Voice AI Agent: Why 800 ms Is the Most Important Number in the Architecture
A voice agent isn't 'a text bot plus TTS'. It's an STT → LLM → TTS chain that has to respond before the human starts talking again. We break down the latency budget, turn-taking, interruption handling, and three typical streaming-pipeline breakdowns.
AdvancedExperimental40 minLiveKit Agents, Deepgram, ElevenLabs, Claude Haiku
1
Voice is latency first, everything else second
In a chatbot a three-second delay is annoying but tolerable. In a voice agent those same three seconds are a catastrophe: the user starts talking again, thinking they weren't heard, and the conversation falls apart.
A voice agent lives inside a latency budget set by human hearing, not engineers. Between the last word and the start of the reply, less than 800 ms must pass — otherwise the listener senses something's off. Under 500 ms — a living conversation. Above 1.5 seconds — walkie-talkie territory.
This budget is the main architectural constraint. Bigger model? Doesn't fit. More context? Doesn't fit. Latency isn't a metric, it's the frame everything else lives inside.
💬 Text chat
- Latency 1–3 s — acceptable
- What matters — answer quality
- Free to choose model and context
🎙️ Voice agent
- Latency <800 ms — or UX breaks
- What matters — conversational rhythm
- Architecture dictated by the budget
Calculate latency first, not last. Before any code, write: STT = X ms, LLM first token = Y ms, TTS first audio = Z ms, total. If the sum exceeds one second, the architecture won't work as voice, and no later optimization will save it.
2
Three stages, three sources of delay
A voice agent is three models in a chain: STT turns voice into text, the LLM thinks, TTS turns the answer back into voice. Each has its own budget, and the sum decides whether the user feels a living conversation.
STT: usually 200–500 ms after speech ends. Whisper or Deepgram; streaming mode cuts it to 100 ms via intermediate hypotheses.
LLM: time-to-first-token. Why model choice is critical — Haiku delivers the first token in ~200 ms, larger models 600+. Prompt size matters too, so prompt caching here isn't optional, it's required.
TTS: the first words must play while the reply is still generating. ElevenLabs and Cartesia emit audio after the first 40–50 characters from the LLM.
STT — voice to text
budget ~200–500 ms, streaming down to 100 ms
LLM — thinking
TTFT 200–600 ms, depends on model + cache
TTS — text to voice
streams after 40–50 chars of reply
Total: <800 ms to first sound
that is the target budget
Streaming on all three stages isn't an optimization — it's the only way to fit the budget. If even one stage runs in batch mode, the architecture has already lost. Pick only streaming-capable models and providers.
3
Turn-taking: the hardest part isn't the models, it's the silence
A human speaks, at some point stops, and the agent has to reply. The question: how does the agent know the human has FINISHED, not just paused mid-thought?
That's VAD (Voice Activity Detection). The simplest VAD measures volume: silence for N ms = end of turn. It works badly: people pause mid-thought ('I need to … [1.5 s] … figure out how many orders we have'). Aggressive VAD interrupts, shy VAD waits forever.
Modern voice agents use semantic turn detection: a small dedicated model looks at words and intonation to decide whether the thought is complete. 'I need to figure out' — definitely not the end. 'How many orders do we have?' — very likely the end.
Tune different thresholds for different contexts. 'Schedule a meeting' — long pauses are fine; 'buy a ticket' — 800 ms of silence already means the user is waiting. One universal threshold fails in both cases.
4
Interruptions aren't error handling — they're a user right
Picture a call-center line. The agent starts talking, you realize it's going the wrong direction, and cut in: 'hold on, wrong topic'. What does a voice agent without interruption handling do? Keeps monologuing over you until it finishes the script. Catastrophe: the user literally can't converse with it.
Interruption handling is three things working in sync. First: VAD on input runs continuously, even while the agent is speaking. The moment it hears the user — signal. Second: TTS must cut mid-word, instantly. Not 'finish the sentence', a hard stop. Third: the LLM needs to know not just that it was interrupted, but WHAT it managed to say. Otherwise on the next turn it continues from where the text ended, while the user only heard the first two sentences.
This is the most underrated part of voice UX. Demos look great because people take neat turns. In production, users interrupt constantly. If interruption doesn't work, the agent doesn't work.
Store in the dialogue history what the USER ACTUALLY HEARD, not what the LLM generated. These diverge during interruptions, and the next turn's grasp of prior context depends on it.
5
Three streaming-pipeline breakdowns nobody writes about
Streaming speeds everything up — but it breaks in three ways you should know upfront.
First: TTS reads '2026' sometimes as 'twenty twenty-six' and sometimes as 'two zero two six', because it gets tokens one at a time and never sees the full number. Fix — a normalizer between LLM and TTS.
Second: the LLM calls a tool mid-answer. You've already spoken 'let me check the order', the tool errors out, nothing to continue with. Fix — emit filler phrases first, build the real answer after the tool call.
Third: mic echo, where the agent hears its own TTS. Fix — echo cancellation in the SDK and ignoring VAD while TTS is active.
| Symptom | Cause | Fix |
|---|---|---|
| TTS reads '2026' inconsistently | Streaming sees tokens piecemeal | Number and date normalizer |
| LLM calls a tool mid-sentence | Streaming can't see the future | Filler first, real answer after |
| Agent replies to itself | TTS echo into the mic | Echo cancel + mute VAD during TTS |
Record sessions and listen back — reading logs won't catch half the problems. Weird intonation, awkward pause, echo — these are caught only by ear. One morning of listening saves a day of debugging.
Result
A working voice agent with an under-800 ms budget from the end of user speech to the first word of reply. Streaming on all three stages, semantic turn detection, correct interruption handling — the feeling isn't 'I'm talking to a bot', it's 'I'm talking to a fast colleague'.