Integrations/Real-Time AI
Advanced12 min

Voice Agents Deep Dive

Deep dive into production voice agent pipelines. STT/TTS provider tradeoffs, latency budgets, interruption handling, telephony integration, and building a complete voice pipeline with Deepgram + ElevenLabs.

Quick Reference

  • Voice pipeline: speech-to-text -> LLM reasoning -> text-to-speech. Each stage must stream to hit the 500ms latency target.
  • STT tradeoffs: Deepgram Nova-2 (fastest, ~100ms), Whisper (most accurate, ~300ms), AssemblyAI (best speaker diarization).
  • TTS tradeoffs: ElevenLabs (most natural voice), OpenAI TTS (simplest API, good quality), PlayHT (best voice cloning).
  • Barge-in handling: detect user speech during agent output, cancel TTS playback, cancel in-flight LLM generation, restart pipeline.
  • Telephony: Twilio Media Streams for phone calls, LiveKit for WebRTC-based real-time audio in web/mobile apps.
  • Latency budget: ~100ms STT + ~200ms LLM first token + ~150ms TTS first audio + ~50ms network = ~500ms total.

The 500ms Latency Budget

Why 500ms matters

Human conversation has a natural turn-taking gap of 200-500ms. Anything above 700ms feels laggy and unnatural. Your entire voice pipeline — from the moment the user stops speaking to the moment they hear the first syllable of the response — must fit within this window. Every millisecond counts.

StageTarget LatencyWhat Happens
VAD + Endpointing50-100msDetect that the user stopped speaking. Too fast = clips words. Too slow = adds dead air.
Speech-to-Text80-150msTranscribe final audio segment. Streaming STT sends partial results; final transcript triggers agent.
LLM First Token150-300msTime to first token from the model. Use streaming; do not wait for full response.
TTS First Audio100-200msSynthesize first sentence into audio. Sentence-level chunking is critical here.
Network Round-Trip20-80msTransport audio between client and server. Co-locate services to minimize hops.

The total budget adds up fast. In practice, you need every stage to overlap with the next. Start STT transcription while audio is still arriving. Start LLM inference the moment the final transcript is ready. Start TTS synthesis on the first complete sentence, not the full response. This pipelining is what makes sub-500ms responses achievable.