Voice AI Agent: Why 800 ms Is the Most Important Number in the Architecture

A voice agent isn't 'a text bot plus TTS'. It's an STT → LLM → TTS chain that has to respond before the human starts talking again. We break down the latency budget, turn-taking, interruption handling, and three typical streaming-pipeline breakdowns.

AdvancedExperimental40 minLiveKit Agents, Deepgram, ElevenLabs, Claude Haiku

Voice is latency first, everything else second

In a chatbot a three-second delay is annoying but tolerable. In a voice agent those same three seconds are a catastrophe: the user starts talking again, thinking they weren't heard, and the conversation falls apart. A voice agent lives inside a latency budget set by human hearing, not engineers. Between the last word and the start of the reply, less than 800 ms must pass — otherwise the listener senses something's off. Under 500 ms — a living conversation. Above 1.5 seconds — walkie-talkie territory. This budget is the main architectural constraint. Bigger model? Doesn't fit. More context? Doesn't fit. Latency isn't a metric, it's the frame everything else lives inside.

💬 Text chat

Latency 1–3 s — acceptable
What matters — answer quality
Free to choose model and context

🎙️ Voice agent

Latency <800 ms — or UX breaks
What matters — conversational rhythm
Architecture dictated by the budget

Calculate latency first, not last. Before any code, write: STT = X ms, LLM first token = Y ms, TTS first audio = Z ms, total. If the sum exceeds one second, the architecture won't work as voice, and no later optimization will save it.

Three stages, three sources of delay

A voice agent is three models in a chain: STT turns voice into text, the LLM thinks, TTS turns the answer back into voice. Each has its own budget, and the sum decides whether the user feels a living conversation. STT: usually 200–500 ms after speech ends. Whisper or Deepgram; streaming mode cuts it to 100 ms via intermediate hypotheses. LLM: time-to-first-token. Why model choice is critical — Haiku delivers the first token in ~200 ms, larger models 600+. Prompt size matters too, so prompt caching here isn't optional, it's required. TTS: the first words must play while the reply is still generating. ElevenLabs and Cartesia emit audio after the first 40–50 characters from the LLM.

STT — voice to text

budget ~200–500 ms, streaming down to 100 ms

LLM — thinking

TTFT 200–600 ms, depends on model + cache

TTS — text to voice

streams after 40–50 chars of reply

Total: <800 ms to first sound

that is the target budget

Streaming on all three stages isn't an optimization — it's the only way to fit the budget. If even one stage runs in batch mode, the architecture has already lost. Pick only streaming-capable models and providers.

Turn-taking: the hardest part isn't the models, it's the silence

A human speaks, at some point stops, and the agent has to reply. The question: how does the agent know the human has FINISHED, not just paused mid-thought? That's VAD (Voice Activity Detection). The simplest VAD measures volume: silence for N ms = end of turn. It works badly: people pause mid-thought ('I need to … [1.5 s] … figure out how many orders we have'). Aggressive VAD interrupts, shy VAD waits forever. Modern voice agents use semantic turn detection: a small dedicated model looks at words and intonation to decide whether the thought is complete. 'I need to figure out' — definitely not the end. 'How many orders do we have?' — very likely the end.

Tune different thresholds for different contexts. 'Schedule a meeting' — long pauses are fine; 'buy a ticket' — 800 ms of silence already means the user is waiting. One universal threshold fails in both cases.

Interruptions aren't error handling — they're a user right

Picture a call-center line. The agent starts talking, you realize it's going the wrong direction, and cut in: 'hold on, wrong topic'. What does a voice agent without interruption handling do? Keeps monologuing over you until it finishes the script. Catastrophe: the user literally can't converse with it. Interruption handling is three things working in sync. First: VAD on input runs continuously, even while the agent is speaking. The moment it hears the user — signal. Second: TTS must cut mid-word, instantly. Not 'finish the sentence', a hard stop. Third: the LLM needs to know not just that it was interrupted, but WHAT it managed to say. Otherwise on the next turn it continues from where the text ended, while the user only heard the first two sentences. This is the most underrated part of voice UX. Demos look great because people take neat turns. In production, users interrupt constantly. If interruption doesn't work, the agent doesn't work.

Store in the dialogue history what the USER ACTUALLY HEARD, not what the LLM generated. These diverge during interruptions, and the next turn's grasp of prior context depends on it.

Three streaming-pipeline breakdowns nobody writes about

Streaming speeds everything up — but it breaks in three ways you should know upfront. First: TTS reads '2026' sometimes as 'twenty twenty-six' and sometimes as 'two zero two six', because it gets tokens one at a time and never sees the full number. Fix — a normalizer between LLM and TTS. Second: the LLM calls a tool mid-answer. You've already spoken 'let me check the order', the tool errors out, nothing to continue with. Fix — emit filler phrases first, build the real answer after the tool call. Third: mic echo, where the agent hears its own TTS. Fix — echo cancellation in the SDK and ignoring VAD while TTS is active.

Symptom	Cause	Fix
TTS reads '2026' inconsistently	Streaming sees tokens piecemeal	Number and date normalizer
LLM calls a tool mid-sentence	Streaming can't see the future	Filler first, real answer after
Agent replies to itself	TTS echo into the mic	Echo cancel + mute VAD during TTS

Record sessions and listen back — reading logs won't catch half the problems. Weird intonation, awkward pause, echo — these are caught only by ear. One morning of listening saves a day of debugging.

Result

A working voice agent with an under-800 ms budget from the end of user speech to the first word of reply. Streaming on all three stages, semantic turn detection, correct interruption handling — the feeling isn't 'I'm talking to a bot', it's 'I'm talking to a fast colleague'.

All Recipes

Voice AI Agent: Why 800 ms Is the Most Important Number in the Architecture

AdvancedExperimental40 minLiveKit Agents, Deepgram, ElevenLabs, Claude Haiku

Voice is latency first, everything else second

💬 Text chat

Latency 1–3 s — acceptable
What matters — answer quality
Free to choose model and context

🎙️ Voice agent

Latency <800 ms — or UX breaks
What matters — conversational rhythm
Architecture dictated by the budget

Three stages, three sources of delay

STT — voice to text

budget ~200–500 ms, streaming down to 100 ms

LLM — thinking

TTFT 200–600 ms, depends on model + cache

TTS — text to voice

streams after 40–50 chars of reply

Total: <800 ms to first sound

that is the target budget

Turn-taking: the hardest part isn't the models, it's the silence

Interruptions aren't error handling — they're a user right

Store in the dialogue history what the USER ACTUALLY HEARD, not what the LLM generated. These diverge during interruptions, and the next turn's grasp of prior context depends on it.

Three streaming-pipeline breakdowns nobody writes about

Symptom	Cause	Fix
TTS reads '2026' inconsistently	Streaming sees tokens piecemeal	Number and date normalizer
LLM calls a tool mid-sentence	Streaming can't see the future	Filler first, real answer after
Agent replies to itself	TTS echo into the mic	Echo cancel + mute VAD during TTS

Voice AI Agent: Why 800 ms Is the Most Important Number in the Architecture

Voice is latency first, everything else second

💬 Text chat

🎙️ Voice agent

Three stages, three sources of delay

Turn-taking: the hardest part isn't the models, it's the silence

Interruptions aren't error handling — they're a user right

Three streaming-pipeline breakdowns nobody writes about

Result

Related Theory

Voice AI Agent: Why 800 ms Is the Most Important Number in the Architecture

Voice is latency first, everything else second

💬 Text chat

🎙️ Voice agent

Three stages, three sources of delay

Turn-taking: the hardest part isn't the models, it's the silence

Interruptions aren't error handling — they're a user right

Three streaming-pipeline breakdowns nobody writes about

Result

Related Theory