Integrations/Specialized Agents
Advanced14 min

Voice Agents

Voice agents cost 10-50x more per interaction than text agents and introduce failure modes that don't exist in chat. This article helps you decide whether voice is worth the complexity, choose the right architecture, understand the real costs, anticipate production failures, and evaluate whether your voice agent actually works.

Quick Reference

  • Three architectures: cascade (STT+LLM+TTS, ~$0.09/5min), speech-to-speech (OpenAI Realtime, ~$0.75/5min), framework (LiveKit/Pipecat, cascade cost + managed infra).
  • The 500ms budget: VAD+endpointing (~75ms) + STT (~100ms) + LLM first token (~200ms) + TTS first audio (~100ms) + network (~50ms).
  • Sentence-level TTS chunking is the single biggest latency optimization — TTFA drops from seconds to hundreds of milliseconds.
  • LiveKit Agents and Pipecat handle VAD, STT, LLM, TTS, barge-in, and WebRTC transport out of the box — start here, not from scratch.
  • TTS dominates cost at scale: Cartesia Sonic 3 ($0.03/min) vs ElevenLabs Flash v2.5 (~$0.24/min) vs OpenAI Realtime audio output (~$0.24/min).
  • Voice-specific failure modes: STT mishearing triggers wrong tool calls, TTS outage = silence, latency spikes cause barge-in loops.
  • Every voice agent needs a text fallback path — when audio quality drops below usable, offer to switch to chat.
  • Evaluate with TTFA (p50 < 500ms, p95 < 800ms), WER on your domain (< 10%), task completion rate, and interruption rate.

Should You Build a Voice Agent?

Voice agents add latency engineering, interruption handling, audio format conversion, VAD tuning, and STT/TTS costs on top of everything you already manage for a text agent. The question isn't whether you can build one — it's whether voice is the right interface for your use case. Most things that seem like they need voice actually work fine with chat.

SignalVoice Likely Worth ItStick With Chat
Hands-free requirementUser is driving, cooking, exercising — typing is impossibleUser is at a desk and could type faster than speaking
Response complexityShort, factual exchanges: scheduling, quick lookups, form-fillingMulti-step reasoning, code output, tables — voice can't render these
ChannelReplacing a phone support line, accessibility requirementWeb or mobile app where text chat is already standard
Emotional toneEmpathy matters — customer complaints, healthcare intakeTechnical or analytical — users prefer reading carefully
Error toleranceUser can confirm or correct easily mid-conversationHigh-stakes actions where STT mishearing could cause damage
Cost sensitivityBudget for 10-50x more per interaction than textCost is a primary constraint — voice rarely pays off at low volume
Voice costs 10-50x more per interaction than chat

A text agent interaction (Claude Haiku, ~500 tokens in + 200 out) costs under $0.001. A 5-minute voice conversation with Deepgram Nova-3 + Claude Haiku + Cartesia Sonic 3 costs ~$0.09. At 1,000 conversations/day that is $2,700/month in inference costs alone, before hosting, monitoring, or telephony. Run the math before committing.

If voice passes the decision gate, see the Voice Agents Deep Dive for provider comparisons, latency budget detail, interruption handling code, and telephony integration (Twilio / LiveKit). This article covers the architecture decision, cost math, failure modes, and evaluation.