DeepgramBacked by Deepgram Startup ProgramLearn more

Voice AI Latency

Voice AI latency is the total time between the user finishing a sentence and hearing the agent begin to respond — the single most important quality metric for conversational voice AI.

Definition

Voice AI latency is the end-to-end delay between when a user finishes speaking and when they hear the voice AI agent begin to reply. For natural-feeling conversation the target is 500–800ms total, which is split across STT (~100–200ms), LLM time-to-first-token (~150–400ms), TTS time-to-first-audio (~150–300ms), and network round-trip. Above 1.2 seconds conversation feels robotic; above 2 seconds the agent feels broken.

Why it matters

Humans perceive a pause of 800ms+ as a stall. A voice AI agent with 1.5s latency is usable but feels obviously artificial. A voice AI agent with 400ms latency feels like talking to a sharp human — which is the whole point of voice AI in the first place.

How to reduce latency

  • Streaming STT: begin transcribing while the user is still speaking, not after.
  • Streaming LLM: use an LLM that starts emitting tokens within 150ms. Groq-hosted models are strongest here.
  • Streaming TTS: pipe tokens into TTS as the LLM generates them, not after. TTFA (time-to-first-audio) <300ms is achievable with providers like ElevenLabs Turbo and Cartesia Sonic.
  • Regional deployment: host as close to your user as possible. For India, Mumbai region is the default.
  • Provider routing: pick the fastest provider for each layer of the stack per workload.