Integrations/Specialized Agents
Advanced10 min

Voice Agents

Building voice-based agents. Speech-to-text + LangGraph + text-to-speech pipeline, streaming audio, and interruption handling.

Quick Reference

  • Voice agent pipeline: microphone input → speech-to-text (Whisper, Deepgram) → LangGraph agent → text-to-speech (ElevenLabs, OpenAI TTS) → speaker output
  • Use streaming end-to-end: stream audio chunks to STT, stream LLM tokens to TTS, and stream audio back to the user for minimal perceived latency
  • Implement barge-in (interruption handling): detect when the user starts speaking during agent output, stop TTS playback, and process the new input
  • Model voice agent state with speaking/listening/processing phases and manage turn-taking explicitly in the graph
  • Use Voice Activity Detection (VAD) to segment user speech into utterances and avoid processing silence or background noise

The Voice Agent Pipeline

Every stage must stream

A voice agent is a pipeline: microphone → STT → LangGraph agent → TTS → speaker. The latency budget for a natural conversation is under 500ms end-to-end. Every stage must stream — batch processing at any point breaks the conversational flow.

  • VAD (Voice Activity Detection) — segment speech from silence, avoid processing background noise. Use Silero VAD or WebRTC VAD.
  • STT (Speech-to-Text) — transcribe audio to text in real-time. Deepgram (lowest latency), Whisper (best accuracy), AssemblyAI (good balance).
  • Agent (LangGraph) — process the transcript, call tools, generate a response. Standard LangGraph with streaming enabled.
  • TTS (Text-to-Speech) — synthesize speech from text. ElevenLabs (most natural), OpenAI TTS (good quality, simple API), Cartesia (lowest latency).
  • Playback — stream audio chunks to the speaker as they arrive. Do not buffer the full response.