Voice Agents
Voice agents cost 10-50x more per interaction than text agents and introduce failure modes that don't exist in chat. This article helps you decide whether voice is worth the complexity, choose the right architecture, understand the real costs, anticipate production failures, and evaluate whether your voice agent actually works.
Quick Reference
- →Three architectures: cascade (STT+LLM+TTS, ~$0.09/5min), speech-to-speech (OpenAI Realtime, ~$0.75/5min), framework (LiveKit/Pipecat, cascade cost + managed infra).
- →The 500ms budget: VAD+endpointing (~75ms) + STT (~100ms) + LLM first token (~200ms) + TTS first audio (~100ms) + network (~50ms).
- →Sentence-level TTS chunking is the single biggest latency optimization — TTFA drops from seconds to hundreds of milliseconds.
- →LiveKit Agents and Pipecat handle VAD, STT, LLM, TTS, barge-in, and WebRTC transport out of the box — start here, not from scratch.
- →TTS dominates cost at scale: Cartesia Sonic 3 ($0.03/min) vs ElevenLabs Flash v2.5 (~$0.24/min) vs OpenAI Realtime audio output (~$0.24/min).
- →Voice-specific failure modes: STT mishearing triggers wrong tool calls, TTS outage = silence, latency spikes cause barge-in loops.
- →Every voice agent needs a text fallback path — when audio quality drops below usable, offer to switch to chat.
- →Evaluate with TTFA (p50 < 500ms, p95 < 800ms), WER on your domain (< 10%), task completion rate, and interruption rate.
Should You Build a Voice Agent?
Voice agents add latency engineering, interruption handling, audio format conversion, VAD tuning, and STT/TTS costs on top of everything you already manage for a text agent. The question isn't whether you can build one — it's whether voice is the right interface for your use case. Most things that seem like they need voice actually work fine with chat.
| Signal | Voice Likely Worth It | Stick With Chat |
|---|---|---|
| Hands-free requirement | User is driving, cooking, exercising — typing is impossible | User is at a desk and could type faster than speaking |
| Response complexity | Short, factual exchanges: scheduling, quick lookups, form-filling | Multi-step reasoning, code output, tables — voice can't render these |
| Channel | Replacing a phone support line, accessibility requirement | Web or mobile app where text chat is already standard |
| Emotional tone | Empathy matters — customer complaints, healthcare intake | Technical or analytical — users prefer reading carefully |
| Error tolerance | User can confirm or correct easily mid-conversation | High-stakes actions where STT mishearing could cause damage |
| Cost sensitivity | Budget for 10-50x more per interaction than text | Cost is a primary constraint — voice rarely pays off at low volume |
A text agent interaction (Claude Haiku, ~500 tokens in + 200 out) costs under $0.001. A 5-minute voice conversation with Deepgram Nova-3 + Claude Haiku + Cartesia Sonic 3 costs ~$0.09. At 1,000 conversations/day that is $2,700/month in inference costs alone, before hosting, monitoring, or telephony. Run the math before committing.
If voice passes the decision gate, see the Voice Agents Deep Dive for provider comparisons, latency budget detail, interruption handling code, and telephony integration (Twilio / LiveKit). This article covers the architecture decision, cost math, failure modes, and evaluation.