Skip to content

Voice Pipeline

🚧Coming Soon — This feature is planned but not yet implemented.

The STT and TTS providers listed below are interface-only stubs. The

provider interfaces are defined but the implementations are not yet connected to actual speech services. :::

Triggerfish supports speech interaction with wake word detection, push-to-talk, and text-to-speech response across macOS, iOS, and Android.

Architecture

Voice pipeline: Wake Word Detection → STT → Agent Processing → TTS → Voice Output

Audio flows through the same agent processing pipeline as text. Voice input is transcribed, enters the session as a classified message, passes through policy hooks, and the response is synthesized back to speech.

Voice Modes

ModeDescriptionPlatform
Voice WakeAlways-on listening for a configurable wake wordmacOS, iOS, Android
Push-to-TalkManual activation via button or keyboard shortcutmacOS (menu bar), iOS, Android
Talk ModeContinuous conversational speechAll platforms

STT Providers

Speech-to-text converts your voice into text for the agent to process.

ProviderTypeNotes
WhisperLocalDefault. Runs on-device, no cloud dependency. Best for privacy.
DeepgramCloudLow-latency streaming transcription.
OpenAI Whisper APICloudHigh accuracy, requires API key.

TTS Providers

Text-to-speech converts agent responses into spoken audio.

ProviderTypeNotes
ElevenLabsCloudDefault. Natural-sounding voices with voice cloning options.
OpenAI TTSCloudHigh quality, multiple voice options.
System VoicesLocalOS-native voices. No cloud dependency.

Provider Registry

Triggerfish uses a provider registry pattern for both STT and TTS. You can plug in any compatible provider by implementing the corresponding interface:

typescript
interface SttProvider {
  transcribe(audio: Uint8Array, options?: SttOptions): Promise<string>;
}

interface TtsProvider {
  synthesize(text: string, options?: TtsOptions): Promise<Uint8Array>;
}

Configuration

Configure voice settings in triggerfish.yaml:

yaml
voice:
  stt:
    provider: whisper # whisper | deepgram | openai
    model: base # Whisper model size (tiny, base, small, medium, large)
  tts:
    provider: elevenlabs # elevenlabs | openai | system
    voice_id: "your-voice" # Provider-specific voice identifier
  wake_word: "triggerfish" # Custom wake word
  push_to_talk:
    shortcut: "Ctrl+Space" # Keyboard shortcut (macOS)

Security Integration

Voice data follows the same classification rules as text:

  • Voice input is classified the same as text input. Transcribed speech enters the session and may escalate taint just like a typed message.
  • TTS output passes through the PRE_OUTPUT hook before synthesis. If the policy engine blocks the response, it is never spoken.
  • Voice sessions carry taint just like text sessions. Switching to voice mid-session does not reset taint.
  • Wake word detection runs locally. No audio is sent to the cloud for wake word matching.
  • Audio recordings (if retained) are classified at the session's taint level.

The voice pipeline will integrate with Buoy companion apps on iOS and

Android, enabling push-to-talk and voice wake from mobile devices. Buoy is not yet available. :::