Voice Agents Deep Dive
Deep dive into production voice agent pipelines. STT/TTS provider tradeoffs, latency budgets, interruption handling, telephony integration, and building a complete voice pipeline with Deepgram + ElevenLabs.
Quick Reference
- →Voice pipeline: speech-to-text -> LLM reasoning -> text-to-speech. Each stage must stream to hit the 500ms latency target.
- →STT tradeoffs: Deepgram Nova-2 (fastest, ~100ms), Whisper (most accurate, ~300ms), AssemblyAI (best speaker diarization).
- →TTS tradeoffs: ElevenLabs (most natural voice), OpenAI TTS (simplest API, good quality), PlayHT (best voice cloning).
- →Barge-in handling: detect user speech during agent output, cancel TTS playback, cancel in-flight LLM generation, restart pipeline.
- →Telephony: Twilio Media Streams for phone calls, LiveKit for WebRTC-based real-time audio in web/mobile apps.
- →Latency budget: ~100ms STT + ~200ms LLM first token + ~150ms TTS first audio + ~50ms network = ~500ms total.
The 500ms Latency Budget
Human conversation has a natural turn-taking gap of 200-500ms. Anything above 700ms feels laggy and unnatural. Your entire voice pipeline — from the moment the user stops speaking to the moment they hear the first syllable of the response — must fit within this window. Every millisecond counts.
| Stage | Target Latency | What Happens |
|---|---|---|
| VAD + Endpointing | 50-100ms | Detect that the user stopped speaking. Too fast = clips words. Too slow = adds dead air. |
| Speech-to-Text | 80-150ms | Transcribe final audio segment. Streaming STT sends partial results; final transcript triggers agent. |
| LLM First Token | 150-300ms | Time to first token from the model. Use streaming; do not wait for full response. |
| TTS First Audio | 100-200ms | Synthesize first sentence into audio. Sentence-level chunking is critical here. |
| Network Round-Trip | 20-80ms | Transport audio between client and server. Co-locate services to minimize hops. |
The total budget adds up fast. In practice, you need every stage to overlap with the next. Start STT transcription while audio is still arriving. Start LLM inference the moment the final transcript is ready. Start TTS synthesis on the first complete sentence, not the full response. This pipelining is what makes sub-500ms responses achievable.