Voice Agents
Whisper + TTS + LLM
The Problem: Typing is slow, and sometimes you just want to talk. How can AI understand spoken language and respond with natural speech?
The Solution: Your Voice Assistant
A voice agent is an AI system you talk to instead of type to. Under the hood it chains three capabilities together: speech recognition (listening), language understanding (thinking), and speech synthesis (speaking). You speak, the system transcribes your words, an LLM decides what to do or say, and a synthesizer turns that answer back into audio. It's exactly what happens when you talk to Siri, Alexa, or an automated customer-service line — only modern agents can also call functions to actually do things (book a table, check an order) and use streaming so they start replying before you finish speaking.
How the pipeline works
The classic design is a three-stage pipeline: Speech-to-Text (STT) → Large Language Model (LLM) → Text-to-Speech (TTS). STT (for example OpenAI Whisper) turns your audio into text but throws away tone, emotion, and timing. The LLM reasons over that text — often with the previous turns kept in its context window so it remembers the conversation — and produces a reply. TTS then voices that reply. Each hop adds delay, and the sum is what users feel as latency: roughly 200ms + 500ms + 300ms ≈ one second per turn. That matters because in human conversation a pause longer than ~400ms already feels awkward, which is why newer end-to-end audio models (like GPT-4o's audio mode) take audio in and emit audio out directly, skipping the STT and TTS stages, preserving emotion, and cutting the delay.
Tradeoffs and a worked example
Use the pipelined STT→LLM→TTS design when you need flexibility (swap any component, read the transcript, add business logic between stages); use end-to-end audio when latency and emotional nuance dominate. The main pitfalls are transcription errors that cascade into wrong answers, and hallucinations where the LLM confidently invents facts the caller then trusts. Worked example: a caller says "I want to cancel my flight tomorrow." STT transcribes it; the LLM calls a lookupBooking function, sees there's a same-day change fee, and replies. If STT confidence is low (noisy line), a robust agent asks "Sorry, could you repeat that?" instead of guessing — a simple confidence threshold that prevents an expensive wrong action.
Think of it like a phone assistant:
- 1. Listen — STT (~200ms): Speech-to-Text converts audio to text. Loses tone, emotion, pauses
- 2. Think — LLM (~500ms): The model reasons about the text and generates a response
- 3. Speak — TTS (~300ms): Text-to-Speech converts the response back to audio
- 4. End-to-end alternative: Models like GPT-4o audio skip STT/TTS entirely — audio in, audio out. Preserves emotion, cuts latency to ~300ms
Pipeline total: ~1 second. But natural conversation tolerance is ~400ms. This gap is why end-to-end models (audio-native) are the future — they cut 2 of 3 stages.
Where Is This Used?
- Smart Speakers: Alexa, Google Home, Siri
- Call Centers: Automated customer service
- Accessibility: Helping visually impaired users
- In-Car Systems: Hands-free navigation and controls
Fun Fact: Modern voice agents can now understand tone, emotion, and even sarcasm! They can also generate speech in different voices, accents, and emotional styles. The uncanny valley is closing rapidly.
Try It Yourself!
See how voice agents process speech and generate natural-sounding responses.
Want to go beyond the traditional STT→LLM→TTS pipeline? See Real-Time Multimodal for end-to-end voice models that cut latency by 70%.
Error Recovery Patterns
Background Noise
Problem: STT produces garbled text
Solution: Noise gate + confidence threshold. If STT confidence < 0.7, ask user to repeat
Heavy Accent
Problem: Word error rate spikes above 20%
Solution: Use Whisper large-v3 for accent robustness. Add domain-specific vocabulary hints
Connection Drop
Problem: WebSocket disconnects mid-conversation
Solution: Maintain conversation state server-side. On reconnect, resume with 'We were discussing...' context
Frequently asked questions
What is a voice agent in simple terms?
A voice agent is an AI system you talk to instead of type to. It recognizes your speech, converts it to text, a large language model (LLM) decides what to say or do, and a synthesizer voices the answer. Familiar examples are Siri, Alexa, and automated customer-service phone lines.
How does a voice agent pipeline work?
The classic design is a three-stage pipeline: Speech-to-Text (STT) → LLM → Text-to-Speech (TTS). STT, for example OpenAI Whisper, turns your audio into text; the LLM reasons over it while keeping previous turns in its context window; TTS voices the reply. Each stage adds delay.
Why do voice agents have latency and how can you reduce it?
In an STT→LLM→TTS pipeline each stage adds time — roughly 200ms + 500ms + 300ms ≈ one second per turn, while a pause longer than ~400ms already feels awkward. Streaming (the agent starts replying before you finish) and end-to-end audio models like GPT-4o's audio mode, which take audio in and emit audio out while skipping STT and TTS, cut that delay.
How is an end-to-end audio model different from an STT→LLM→TTS pipeline?
The STT→LLM→TTS pipeline is flexible: you can swap any component, read the transcript, and add business logic between stages. An end-to-end model works on audio directly, preserving tone and emotion and cutting latency, but giving less control over intermediate steps. Choose the pipeline for flexibility and end-to-end when latency and emotional nuance matter most.
Try it yourself
Interactive demo of this technique
Properly process voice transcription: formatting and punctuation
so basically i went to the doctor today he said everything is fine tests are normal but i need to take vitamins especially vitamin d and also he recommended walking more outdoors oh and also schedule an eye doctor appointment in three months
Went to the doctor today. He said everything is fine — tests are normal.
Recommendations:
- Take vitamins, especially vitamin D.
- Walk more outdoors.
- Schedule an eye doctor appointment in three months.
Summary:
- Test results are normal
- Vitamin D — essential
- Outdoor walks
- Eye doctor in 3 months
Voice transcription without processing is a mess. A prompt with instructions (punctuation + structure + cleanup + summary) transforms speech flow into structured text with action items.
Create a free account to solve challenges
5 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path