Lesson 7

Voice Agents

Whisper + TTS + LLM

The Problem: Typing is slow, and sometimes you just want to talk. How can AI understand spoken language and respond with natural speech?

The Solution: Your Voice Assistant

A voice agent is an AI system you talk to instead of type to. Under the hood it chains three capabilities together: speech recognition (listening), language understanding (thinking), and speech synthesis (speaking). You speak, the system transcribes your words, an LLM decides what to do or say, and a synthesizer turns that answer back into audio. It's exactly what happens when you talk to Siri, Alexa, or an automated customer-service line — only modern agents can also call functions to actually do things (book a table, check an order) and use streaming so they start replying before you finish speaking.

How the pipeline works

The classic design is a three-stage pipeline: Speech-to-Text (STT) → Large Language Model (LLM) → Text-to-Speech (TTS). STT (for example OpenAI Whisper) turns your audio into text but throws away tone, emotion, and timing. The LLM reasons over that text — often with the previous turns kept in its context window so it remembers the conversation — and produces a reply. TTS then voices that reply. Each hop adds delay, and the sum is what users feel as latency: roughly 200ms + 500ms + 300ms ≈ one second per turn. That matters because in human conversation a pause longer than ~400ms already feels awkward, which is why newer end-to-end audio models (like GPT-4o's audio mode) take audio in and emit audio out directly, skipping the STT and TTS stages, preserving emotion, and cutting the delay.

Tradeoffs and a worked example

Use the pipelined STT→LLM→TTS design when you need flexibility (swap any component, read the transcript, add business logic between stages); use end-to-end audio when latency and emotional nuance dominate. The main pitfalls are transcription errors that cascade into wrong answers, and hallucinations where the LLM confidently invents facts the caller then trusts. Worked example: a caller says "I want to cancel my flight tomorrow." STT transcribes it; the LLM calls a lookupBooking function, sees there's a same-day change fee, and replies. If STT confidence is low (noisy line), a robust agent asks "Sorry, could you repeat that?" instead of guessing — a simple confidence threshold that prevents an expensive wrong action.

Think of it like a phone assistant:

1. Listen — STT (~200ms): Speech-to-Text converts audio to text. Loses tone, emotion, pauses
2. Think — LLM (~500ms): The model reasons about the text and generates a response
3. Speak — TTS (~300ms): Text-to-Speech converts the response back to audio
4. End-to-end alternative: Models like GPT-4o audio skip STT/TTS entirely — audio in, audio out. Preserves emotion, cuts latency to ~300ms

Pipeline total: ~1 second. But natural conversation tolerance is ~400ms. This gap is why end-to-end models (audio-native) are the future — they cut 2 of 3 stages.

Where Is This Used?

Smart Speakers: Alexa, Google Home, Siri
Call Centers: Automated customer service
Accessibility: Helping visually impaired users
In-Car Systems: Hands-free navigation and controls

Fun Fact: Modern voice agents can now understand tone, emotion, and even sarcasm! They can also generate speech in different voices, accents, and emotional styles. The uncanny valley is closing rapidly.

Try It Yourself!

See how voice agents process speech and generate natural-sounding responses.

Want to go beyond the traditional STT→LLM→TTS pipeline? See Real-Time Multimodal for end-to-end voice models that cut latency by 70%.

Error Recovery Patterns

Background Noise

Problem: STT produces garbled text

Solution: Noise gate + confidence threshold. If STT confidence < 0.7, ask user to repeat

Heavy Accent

Problem: Word error rate spikes above 20%

Solution: Use Whisper large-v3 for accent robustness. Add domain-specific vocabulary hints

Connection Drop

Problem: WebSocket disconnects mid-conversation

Solution: Maintain conversation state server-side. On reconnect, resume with 'We were discussing...' context

Frequently asked questions

What is a voice agent in simple terms?

A voice agent is an AI system you talk to instead of type to. It recognizes your speech, converts it to text, a large language model (LLM) decides what to say or do, and a synthesizer voices the answer. Familiar examples are Siri, Alexa, and automated customer-service phone lines.

How does a voice agent pipeline work?

The classic design is a three-stage pipeline: Speech-to-Text (STT) → LLM → Text-to-Speech (TTS). STT, for example OpenAI Whisper, turns your audio into text; the LLM reasons over it while keeping previous turns in its context window; TTS voices the reply. Each stage adds delay.

Why do voice agents have latency and how can you reduce it?

In an STT→LLM→TTS pipeline each stage adds time — roughly 200ms + 500ms + 300ms ≈ one second per turn, while a pause longer than ~400ms already feels awkward. Streaming (the agent starts replying before you finish) and end-to-end audio models like GPT-4o's audio mode, which take audio in and emit audio out while skipping STT and TTS, cut that delay.

How is an end-to-end audio model different from an STT→LLM→TTS pipeline?

The STT→LLM→TTS pipeline is flexible: you can swap any component, read the transcript, and add business logic between stages. An end-to-end model works on audio directly, preserving tone and emotion and cutting latency, but giving less control over intermediate steps. Choose the pipeline for flexibility and end-to-end when latency and emotional nuance matter most.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerWriting

Properly process voice transcription: formatting and punctuation

Without technique

Prompt

Here is a voice message transcription: "so basically i went to the doctor today he said everything is fine tests are normal but i need to take vitamins especially vitamin d and also he recommended walking more outdoors oh and also schedule an eye doctor appointment in three months"

Response

so basically i went to the doctor today he said everything is fine tests are normal but i need to take vitamins especially vitamin d and also he recommended walking more outdoors oh and also schedule an eye doctor appointment in three months

Tokens:65/60

Time:380ms

Quality:

With Voice

With technique

Prompt

Process this voice message transcription: 1. Add punctuation 2. Break into sentences and paragraphs by meaning 3. Remove filler words (so, basically, like) but keep the meaning 4. At the end — brief summary: key points as a list Transcription: "so basically i went to the doctor today he said everything is fine tests are normal but i need to take vitamins especially vitamin d and also he recommended walking more outdoors oh and also schedule an eye doctor appointment in three months"

Response

Went to the doctor today. He said everything is fine — tests are normal.

Recommendations:

Take vitamins, especially vitamin D.
Walk more outdoors.
Schedule an eye doctor appointment in three months.

Summary:

Test results are normal
Vitamin D — essential
Outdoor walks
Eye doctor in 3 months

👁️Raw transcription without punctuation is unreadable and unprocessable

🧠Post-processing: punctuation + structure + filler removal → readable text

✅Summary as a list extracts action items from stream-of-consciousness speech

Tokens:110/95

Time:650ms

Quality:

Why this works

Voice transcription without processing is a mess. A prompt with instructions (punctuation + structure + cleanup + summary) transforms speech flow into structured text with action items.

1 / 2

Practice Challenges

Create a free account to solve challenges

5 AI-verified challenges for this lesson

Related lessons:Function Calling Vision Basics

This lesson is part of a structured LLM course.

My Learning Path

Lesson 7

Voice Agents

Whisper + TTS + LLM

The Problem: Typing is slow, and sometimes you just want to talk. How can AI understand spoken language and respond with natural speech?

The Solution: Your Voice Assistant

How the pipeline works

Tradeoffs and a worked example

Think of it like a phone assistant:

1. Listen — STT (~200ms): Speech-to-Text converts audio to text. Loses tone, emotion, pauses
2. Think — LLM (~500ms): The model reasons about the text and generates a response
3. Speak — TTS (~300ms): Text-to-Speech converts the response back to audio
4. End-to-end alternative: Models like GPT-4o audio skip STT/TTS entirely — audio in, audio out. Preserves emotion, cuts latency to ~300ms

Pipeline total: ~1 second. But natural conversation tolerance is ~400ms. This gap is why end-to-end models (audio-native) are the future — they cut 2 of 3 stages.

Where Is This Used?

Smart Speakers: Alexa, Google Home, Siri
Call Centers: Automated customer service
Accessibility: Helping visually impaired users
In-Car Systems: Hands-free navigation and controls

Try It Yourself!

See how voice agents process speech and generate natural-sounding responses.

Want to go beyond the traditional STT→LLM→TTS pipeline? See Real-Time Multimodal for end-to-end voice models that cut latency by 70%.

Error Recovery Patterns

Background Noise

Problem: STT produces garbled text

Solution: Noise gate + confidence threshold. If STT confidence < 0.7, ask user to repeat

Heavy Accent

Problem: Word error rate spikes above 20%

Solution: Use Whisper large-v3 for accent robustness. Add domain-specific vocabulary hints

Connection Drop

Problem: WebSocket disconnects mid-conversation

Solution: Maintain conversation state server-side. On reconnect, resume with 'We were discussing...' context

Frequently asked questions

What is a voice agent in simple terms?

How does a voice agent pipeline work?

Why do voice agents have latency and how can you reduce it?

How is an end-to-end audio model different from an STT→LLM→TTS pipeline?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerWriting

Properly process voice transcription: formatting and punctuation

Without technique

Prompt

Response

Tokens:65/60

Time:380ms

Quality:

With Voice

With technique

Prompt

Response

Went to the doctor today. He said everything is fine — tests are normal.

Recommendations:

Take vitamins, especially vitamin D.
Walk more outdoors.
Schedule an eye doctor appointment in three months.

Summary:

Test results are normal
Vitamin D — essential
Outdoor walks
Eye doctor in 3 months

👁️Raw transcription without punctuation is unreadable and unprocessable

🧠Post-processing: punctuation + structure + filler removal → readable text

✅Summary as a list extracts action items from stream-of-consciousness speech

Tokens:110/95

Time:650ms

Quality:

Why this works

Voice transcription without processing is a mess. A prompt with instructions (punctuation + structure + cleanup + summary) transforms speech flow into structured text with action items.

1 / 2

Practice Challenges

Create a free account to solve challenges

5 AI-verified challenges for this lesson

Related lessons:Function Calling Vision Basics

This lesson is part of a structured LLM course.

My Learning Path