Lesson 5New

Inference & KV-Cache

How LLMs generate text

The Problem: You type a question, and the AI gives you a complete answer. But how does it actually "think" and produce text? Does it have the whole answer ready in advance?

The Solution: A Typewriter with a Notebook

Imagine an old typewriter that can only type one letter at a time. But next to it lies a notebook with notes about "what usually comes next." Settings like temperature control how adventurous each choice is. This running process — feeding text into a trained model and reading out what it predicts — is called inference (the model "infers" the next token). It is a separate stage from training: the model's weights are already frozen, and inference simply uses them.

LLMs don't generate entire sentences at once. The Transformer produces text token by token (words or parts of words), and each new token depends on all the previous ones. That generation happens in two phases. First comes prefill: the whole prompt is read in one parallel pass, and the model builds an internal summary of every token. Then comes decode: tokens are emitted one at a time, each pass producing exactly one new token. Prefill is fast and parallel; decode is the slow, step-by-step part you watch on screen as the answer streams out.

Why the KV-cache matters

To avoid re-reading the entire history on every step, models keep a KV-cache — the stored Keys and Values that attention computed for tokens already seen. Without it, generating token 1000 would force the model to re-process all 999 earlier tokens; with it, each new step only processes the single newest token and looks up the rest. The cache is what makes long answers affordable, but it also grows with the context (context window): more tokens means more cache, which eats GPU memory and is usually the real ceiling on how long a conversation can get.

Latency, throughput, and cost

Two numbers describe inference speed. Latency is how long you wait — driven by prompt length (longer prompts mean a heavier prefill) and by output length (more tokens to decode). Throughput is how many tokens the whole GPU produces per second across all users. They pull in opposite directions: servers batch many requests together to raise throughput and cut cost per token, but a fuller batch can add a little latency for any single user. A concrete example: a 50-token question with a 200-token answer feels instant, while pasting a 100,000-token document forces a huge prefill, fills the cache, and makes the very first word arrive noticeably later — even though each token after that still streams at the same rate.

Think of it like a typewriter with a notebook:

1. You type: "The capital of France is"
2. Looks at ALL previous text: The typewriter looks at ALL previous text
3. Checks the notebook: After such a phrase, the word "Paris" often comes
4. Types: "Paris"
5. Now looks at: "The capital of France is Paris"
6. Decides what comes next: Maybe a period, or ", which is known for..."

This process is called inference — the model "infers" what word should come next.

Where Is This Used?

ChatGPT/Claude: each answer is generated word by word (that's why you see text appearing gradually)
Code completion: GitHub Copilot predicts the next line
Translation: models translate one word at a time
Text summarization: summary is built piece by piece

Fun Fact: GPT-4 generates about 50-100 tokens per second. Each token requires billions of calculations! That's why powerful GPUs are needed — they perform thousands of operations in parallel.

Try It Yourself!

Below is an interactive visualization. Watch how the model generates text token by token, considering all previous context!

LLM Inference & KV-Cache

Watch how LLMs process your prompt and generate responses. The KV-Cache is crucial for efficient generation!

KV-Cache in Simple Terms

Imagine reading a book and taking notes:

✗Without cache: Re-read the ENTIRE book from the start each time
✓With cache: Check your notes — fast and efficient

Current Phase:Idle

Prompt(7 tokens)

What

the

capital

France

Generated(0/5 tokens)

KV-Cache

0 entries

Cache empty

Like notes in a notebook — stores "summaries" of previous tokens so the model doesn't re-read everything

What are K and V?

KKey — "What is this token about?" (summary)
VValue — "What's important?" (useful info)

GPU Memory

0 MB0.0 MB20 MB

Why is the first token slow?

Before generating the first word, the model must read and "understand" your entire prompt — all at once. This is called prefill, and it takes most of the time. After that, each new word is fast: the model just looks at its notes (KV-Cache) instead of re-reading everything.

Key Insight

KV-Cache trades memory for speed. Each token's K/V vectors (~1MB per token for large models) stay in memory. This limits context length — 128K context = ~128GB of KV-Cache!

Frequently asked questions

What is inference in an LLM?

Inference is running an already-trained model: you feed in text and the model predicts the next tokens. The weights stay frozen, unlike during training. This is the stage where the model generates its answer token by token.

What is the KV-cache and why does it matter?

The KV-cache stores the Keys and Values that attention already computed for past tokens. Without it, every new token would force the model to reprocess the whole history. With it, each step only handles the single newest token, which keeps long answers fast and affordable.

Why does a long context slow inference down?

A long prompt requires a heavy prefill pass — the model must read every token and fill the KV-cache before producing the first word. The cache grows with context length and consumes GPU memory. That is why a 100,000-token document adds a noticeable delay before the first word, even though tokens after that still stream at the same rate.

What is the difference between prefill and decode?

Prefill is the first parallel pass over the whole prompt, building a representation of all input tokens at once. Decode is the step-by-step generation where each pass emits exactly one new token. Prefill is fast and parallel, while decode is the slow part you see as the streaming text on screen.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Get a yes/no answer to a simple question

Without technique

Prompt

Is Python an interpreted programming language? (max_tokens: 4096)

Response

Yes, Python is an interpreted programming language. This means Python code is executed line by line by an interpreter, rather than being compiled entirely into machine code before execution. However, technically Python first compiles to bytecode (.pyc files), which is then executed by the Python virtual machine (CPython). There are also JIT compilers... [continues for 500 words]

Tokens:18/380

Time:3200ms

Quality:

With Inference

With technique

Prompt

Is Python an interpreted language? Answer in one word: yes or no. (max_tokens: 10)

Response

Yes.

👁️max_tokens=4096 for a yes/no question — the model fills the entire available buffer with unnecessary details

🧠Inference time is proportional to the number of generated tokens (autoregressive — token by token)

✅max_tokens=10 + explicit format instruction → 18x faster, 127x fewer output tokens

Tokens:22/3

Time:180ms

Quality:

Why this works

Inference is autoregressive: each token is generated sequentially. Reducing max_tokens from 4096 to 10 and specifying format sped up the response 18x without losing quality.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Transformers Decoding

This lesson is part of a structured LLM course.

My Learning Path

Lesson 5New

Inference & KV-Cache

How LLMs generate text

The Problem: You type a question, and the AI gives you a complete answer. But how does it actually "think" and produce text? Does it have the whole answer ready in advance?

The Solution: A Typewriter with a Notebook

Why the KV-cache matters

Latency, throughput, and cost

Think of it like a typewriter with a notebook:

1. You type: "The capital of France is"
2. Looks at ALL previous text: The typewriter looks at ALL previous text
3. Checks the notebook: After such a phrase, the word "Paris" often comes
4. Types: "Paris"
5. Now looks at: "The capital of France is Paris"
6. Decides what comes next: Maybe a period, or ", which is known for..."

This process is called inference — the model "infers" what word should come next.

Where Is This Used?

ChatGPT/Claude: each answer is generated word by word (that's why you see text appearing gradually)
Code completion: GitHub Copilot predicts the next line
Translation: models translate one word at a time
Text summarization: summary is built piece by piece

Fun Fact: GPT-4 generates about 50-100 tokens per second. Each token requires billions of calculations! That's why powerful GPUs are needed — they perform thousands of operations in parallel.

Try It Yourself!

Below is an interactive visualization. Watch how the model generates text token by token, considering all previous context!

LLM Inference & KV-Cache

Watch how LLMs process your prompt and generate responses. The KV-Cache is crucial for efficient generation!

KV-Cache in Simple Terms

Imagine reading a book and taking notes:

✗Without cache: Re-read the ENTIRE book from the start each time
✓With cache: Check your notes — fast and efficient

Current Phase:Idle

Prompt(7 tokens)

What

the

capital

France

Generated(0/5 tokens)

KV-Cache

0 entries

Cache empty

Like notes in a notebook — stores "summaries" of previous tokens so the model doesn't re-read everything

What are K and V?

KKey — "What is this token about?" (summary)
VValue — "What's important?" (useful info)

GPU Memory

0 MB0.0 MB20 MB

Why is the first token slow?

Key Insight

KV-Cache trades memory for speed. Each token's K/V vectors (~1MB per token for large models) stay in memory. This limits context length — 128K context = ~128GB of KV-Cache!

Frequently asked questions

What is inference in an LLM?

What is the KV-cache and why does it matter?

Why does a long context slow inference down?

What is the difference between prefill and decode?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Get a yes/no answer to a simple question

Without technique

Prompt

Is Python an interpreted programming language? (max_tokens: 4096)

Response

Tokens:18/380

Time:3200ms

Quality:

With Inference

With technique

Prompt

Is Python an interpreted language? Answer in one word: yes or no. (max_tokens: 10)

Response

Yes.

👁️max_tokens=4096 for a yes/no question — the model fills the entire available buffer with unnecessary details

🧠Inference time is proportional to the number of generated tokens (autoregressive — token by token)

✅max_tokens=10 + explicit format instruction → 18x faster, 127x fewer output tokens

Tokens:22/3

Time:180ms

Quality:

Why this works

Inference is autoregressive: each token is generated sequentially. Reducing max_tokens from 4096 to 10 and specifying format sped up the response 18x without losing quality.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Transformers Decoding

This lesson is part of a structured LLM course.

My Learning Path