Voice Agents

Voice agents cost 10-50x more per interaction than text agents and introduce failure modes that don't exist in chat. This article helps you decide whether voice is worth the complexity, choose the right architecture, understand the real costs, anticipate production failures, and evaluate whether your voice agent actually works.

Quick Reference

→Three architectures: cascade (STT+LLM+TTS, ~$0.09/5min), speech-to-speech (OpenAI Realtime, ~$0.75/5min), framework (LiveKit/Pipecat, cascade cost + managed infra).
→The 500ms budget: VAD+endpointing (~75ms) + STT (~100ms) + LLM first token (~200ms) + TTS first audio (~100ms) + network (~50ms).
→Sentence-level TTS chunking is the single biggest latency optimization — TTFA drops from seconds to hundreds of milliseconds.
→LiveKit Agents and Pipecat handle VAD, STT, LLM, TTS, barge-in, and WebRTC transport out of the box — start here, not from scratch.
→TTS dominates cost at scale: Cartesia Sonic 3 ($0.03/min) vs ElevenLabs Flash v2.5 (~$0.24/min) vs OpenAI Realtime audio output (~$0.24/min).
→Voice-specific failure modes: STT mishearing triggers wrong tool calls, TTS outage = silence, latency spikes cause barge-in loops.
→Every voice agent needs a text fallback path — when audio quality drops below usable, offer to switch to chat.
→Evaluate with TTFA (p50 < 500ms, p95 < 800ms), WER on your domain (< 10%), task completion rate, and interruption rate.

Should You Build a Voice Agent?

Voice agents add latency engineering, interruption handling, audio format conversion, VAD tuning, and STT/TTS costs on top of everything you already manage for a text agent. The question isn't whether you can build one — it's whether voice is the right interface for your use case. Most things that seem like they need voice actually work fine with chat.

Signal	Voice Likely Worth It	Stick With Chat
Hands-free requirement	User is driving, cooking, exercising — typing is impossible	User is at a desk and could type faster than speaking
Response complexity	Short, factual exchanges: scheduling, quick lookups, form-filling	Multi-step reasoning, code output, tables — voice can't render these
Channel	Replacing a phone support line, accessibility requirement	Web or mobile app where text chat is already standard
Emotional tone	Empathy matters — customer complaints, healthcare intake	Technical or analytical — users prefer reading carefully
Error tolerance	User can confirm or correct easily mid-conversation	High-stakes actions where STT mishearing could cause damage
Cost sensitivity	Budget for 10-50x more per interaction than text	Cost is a primary constraint — voice rarely pays off at low volume

Voice costs 10-50x more per interaction than chat

A text agent interaction (Claude Haiku, ~500 tokens in + 200 out) costs under $0.001. A 5-minute voice conversation with Deepgram Nova-3 + Claude Haiku + Cartesia Sonic 3 costs ~$0.09. At 1,000 conversations/day that is $2,700/month in inference costs alone, before hosting, monitoring, or telephony. Run the math before committing.

If voice passes the decision gate, see the Voice Agents Deep Dive for provider comparisons, latency budget detail, interruption handling code, and telephony integration (Twilio / LiveKit). This article covers the architecture decision, cost math, failure modes, and evaluation.

Voice Agents

Should You Build a Voice Agent?

Three Architectures

The Voice Pipeline Mental Model

Sign in to read this article