Advanced15 min

Voice Agents Deep Dive

Deep dive into production voice agent pipelines in 2026. Covers the pipeline-vs-Realtime-API architecture decision, updated STT and TTS provider choices (Deepgram Nova-3, ElevenLabs Flash v2.5, Cartesia Sonic 3), production-grade barge-in handling, cost modeling, and when to use LiveKit Agents Framework instead of rolling your own pipeline.

Quick Reference

→Architecture fork: STT→LLM→TTS pipeline (provider control, debuggable) vs. OpenAI Realtime API (speech-to-speech, ~200ms latency, vendor lock-in). Pick one early.
→For the pipeline path, default to LiveKit Agents Framework — VoicePipelineAgent handles VAD, barge-in, provider swapping, and WebSocket lifecycle.
→Voice pipeline: each stage must stream AND overlap — STT starts while audio arrives, LLM starts on first partial, TTS starts on first sentence.
→STT: Deepgram Nova-3 (~100ms, lowest latency), GPT-4o transcription (most accurate), AssemblyAI Universal-3-Pro (real-time speaker diarization).
→TTS: ElevenLabs Flash v2.5 (~75ms, best quality), Cartesia Sonic 3 (~90ms, lowest latency), gpt-4o-mini-tts (steerability, per-token pricing).
→Barge-in: detect speech, cancel TTS within 50ms, cancel LLM generation, save partial to conversation history, resume listening.
→TTS cost scales per-character: compute turns × chars/response × rate before committing to a provider at any volume.

Should You Build a Voice Agent?

Voice adds real complexity — not just audio I/O

A voice interface imposes a latency budget (sub-500ms), a cost multiplier (TTS is expensive at scale), a reliability tax (WebSocket reconnection, audio format conversion, telephony edge cases), and a UX constraint (no markdown, no tables, no URLs in responses). Before building, confirm that voice interaction actually improves the product — not just that it's cool.

▸Voice adds value: phone-based customer support (users are on a call), hands-free workflows (driving, cooking, medical), accessibility (users who can't type), natural language over complex UIs.
▸Voice is likely wrong: form filling, structured data review, B2B tools where users are at a desk with a keyboard, anything where the agent response contains tables, code, or lists.
▸The latency constraint is non-negotiable: users notice anything above ~700ms. Every optimization decision downstream flows from this.
▸The cost is dominated by TTS, not LLM: ElevenLabs Flash v2.5 charges per character. A chatty agent with 10 turns and 150 chars per response costs ~$0.45/call at $0.30/1k chars — real money at scale.

Pipeline vs. Speech-to-Speech

Every voice agent project starts with one architectural choice: build an STT → LLM → TTS pipeline where you control each component, or use OpenAI's Realtime API which handles the full audio-in/audio-out loop in a single model. This decision affects provider flexibility, debugging capability, latency, and vendor risk. It's hard to reverse after you've built tooling around it.

The 500ms Latency Budget

Why 500ms and not some other number

Human conversational turn-taking averages 200ms with natural overlap. Research on voice response latency (Stivers et al., 2009) shows listener discomfort increases sharply around 700ms. In practice, sub-500ms feels responsive; 500–700ms is acceptable; above 700ms users mentally classify the agent as 'laggy' and compensate by speaking more slowly or repeating themselves.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.