Advanced15 min
Voice Agents Deep Dive
Deep dive into production voice agent pipelines in 2026. Covers the pipeline-vs-Realtime-API architecture decision, updated STT and TTS provider choices (Deepgram Nova-3, ElevenLabs Flash v2.5, Cartesia Sonic 3), production-grade barge-in handling, cost modeling, and when to use LiveKit Agents Framework instead of rolling your own pipeline.
Quick Reference
- →Architecture fork: STT→LLM→TTS pipeline (provider control, debuggable) vs. OpenAI Realtime API (speech-to-speech, ~200ms latency, vendor lock-in). Pick one early.
- →For the pipeline path, default to LiveKit Agents Framework — VoicePipelineAgent handles VAD, barge-in, provider swapping, and WebSocket lifecycle.
- →Voice pipeline: each stage must stream AND overlap — STT starts while audio arrives, LLM starts on first partial, TTS starts on first sentence.
- →STT: Deepgram Nova-3 (~100ms, lowest latency), GPT-4o transcription (most accurate), AssemblyAI Universal-3-Pro (real-time speaker diarization).
- →TTS: ElevenLabs Flash v2.5 (~75ms, best quality), Cartesia Sonic 3 (~90ms, lowest latency), gpt-4o-mini-tts (steerability, per-token pricing).
- →Barge-in: detect speech, cancel TTS within 50ms, cancel LLM generation, save partial to conversation history, resume listening.
- →TTS cost scales per-character: compute turns × chars/response × rate before committing to a provider at any volume.
Should You Build a Voice Agent?
Voice adds real complexity — not just audio I/O
A voice interface imposes a latency budget (sub-500ms), a cost multiplier (TTS is expensive at scale), a reliability tax (WebSocket reconnection, audio format conversion, telephony edge cases), and a UX constraint (no markdown, no tables, no URLs in responses). Before building, confirm that voice interaction actually improves the product — not just that it's cool.
- ▸Voice adds value: phone-based customer support (users are on a call), hands-free workflows (driving, cooking, medical), accessibility (users who can't type), natural language over complex UIs.
- ▸Voice is likely wrong: form filling, structured data review, B2B tools where users are at a desk with a keyboard, anything where the agent response contains tables, code, or lists.
- ▸The latency constraint is non-negotiable: users notice anything above ~700ms. Every optimization decision downstream flows from this.
- ▸The cost is dominated by TTS, not LLM: ElevenLabs Flash v2.5 charges per character. A chatty agent with 10 turns and 150 chars per response costs ~$0.45/call at $0.30/1k chars — real money at scale.