Integrations/Real-Time AI
Advanced15 min

Voice Agents Deep Dive

Deep dive into production voice agent pipelines in 2026. Covers the pipeline-vs-Realtime-API architecture decision, updated STT and TTS provider choices (Deepgram Nova-3, ElevenLabs Flash v2.5, Cartesia Sonic 3), production-grade barge-in handling, cost modeling, and when to use LiveKit Agents Framework instead of rolling your own pipeline.

Quick Reference

  • Architecture fork: STT→LLM→TTS pipeline (provider control, debuggable) vs. OpenAI Realtime API (speech-to-speech, ~200ms latency, vendor lock-in). Pick one early.
  • For the pipeline path, default to LiveKit Agents Framework — VoicePipelineAgent handles VAD, barge-in, provider swapping, and WebSocket lifecycle.
  • Voice pipeline: each stage must stream AND overlap — STT starts while audio arrives, LLM starts on first partial, TTS starts on first sentence.
  • STT: Deepgram Nova-3 (~100ms, lowest latency), GPT-4o transcription (most accurate), AssemblyAI Universal-3-Pro (real-time speaker diarization).
  • TTS: ElevenLabs Flash v2.5 (~75ms, best quality), Cartesia Sonic 3 (~90ms, lowest latency), gpt-4o-mini-tts (steerability, per-token pricing).
  • Barge-in: detect speech, cancel TTS within 50ms, cancel LLM generation, save partial to conversation history, resume listening.
  • TTS cost scales per-character: compute turns × chars/response × rate before committing to a provider at any volume.

Should You Build a Voice Agent?

Voice adds real complexity — not just audio I/O

A voice interface imposes a latency budget (sub-500ms), a cost multiplier (TTS is expensive at scale), a reliability tax (WebSocket reconnection, audio format conversion, telephony edge cases), and a UX constraint (no markdown, no tables, no URLs in responses). Before building, confirm that voice interaction actually improves the product — not just that it's cool.

  • Voice adds value: phone-based customer support (users are on a call), hands-free workflows (driving, cooking, medical), accessibility (users who can't type), natural language over complex UIs.
  • Voice is likely wrong: form filling, structured data review, B2B tools where users are at a desk with a keyboard, anything where the agent response contains tables, code, or lists.
  • The latency constraint is non-negotiable: users notice anything above ~700ms. Every optimization decision downstream flows from this.
  • The cost is dominated by TTS, not LLM: ElevenLabs Flash v2.5 charges per character. A chatty agent with 10 turns and 150 chars per response costs ~$0.45/call at $0.30/1k chars — real money at scale.