Inference & KV-Cache
How LLMs generate text
The Problem: You type a question, and the AI gives you a complete answer. But how does it actually "think" and produce text? Does it have the whole answer ready in advance?
The Solution: A Typewriter with a Notebook
Imagine an old typewriter that can only type one letter at a time. But next to it lies a notebook with notes about "what usually comes next." Settings like temperature control how adventurous each choice is. This running process — feeding text into a trained model and reading out what it predicts — is called inference (the model "infers" the next token). It is a separate stage from training: the model's weights are already frozen, and inference simply uses them.
LLMs don't generate entire sentences at once. The Transformer produces text token by token (words or parts of words), and each new token depends on all the previous ones. That generation happens in two phases. First comes prefill: the whole prompt is read in one parallel pass, and the model builds an internal summary of every token. Then comes decode: tokens are emitted one at a time, each pass producing exactly one new token. Prefill is fast and parallel; decode is the slow, step-by-step part you watch on screen as the answer streams out.
Why the KV-cache matters
To avoid re-reading the entire history on every step, models keep a KV-cache — the stored Keys and Values that attention computed for tokens already seen. Without it, generating token 1000 would force the model to re-process all 999 earlier tokens; with it, each new step only processes the single newest token and looks up the rest. The cache is what makes long answers affordable, but it also grows with the context (context window): more tokens means more cache, which eats GPU memory and is usually the real ceiling on how long a conversation can get.
Latency, throughput, and cost
Two numbers describe inference speed. Latency is how long you wait — driven by prompt length (longer prompts mean a heavier prefill) and by output length (more tokens to decode). Throughput is how many tokens the whole GPU produces per second across all users. They pull in opposite directions: servers batch many requests together to raise throughput and cut cost per token, but a fuller batch can add a little latency for any single user. A concrete example: a 50-token question with a 200-token answer feels instant, while pasting a 100,000-token document forces a huge prefill, fills the cache, and makes the very first word arrive noticeably later — even though each token after that still streams at the same rate.
Think of it like a typewriter with a notebook:
- 1. You type: "The capital of France is"
- 2. Looks at ALL previous text: The typewriter looks at ALL previous text
- 3. Checks the notebook: After such a phrase, the word "Paris" often comes
- 4. Types: "Paris"
- 5. Now looks at: "The capital of France is Paris"
- 6. Decides what comes next: Maybe a period, or ", which is known for..."
This process is called inference — the model "infers" what word should come next.
Where Is This Used?
- ChatGPT/Claude: each answer is generated word by word (that's why you see text appearing gradually)
- Code completion: GitHub Copilot predicts the next line
- Translation: models translate one word at a time
- Text summarization: summary is built piece by piece
Fun Fact: GPT-4 generates about 50-100 tokens per second. Each token requires billions of calculations! That's why powerful GPUs are needed — they perform thousands of operations in parallel.
Try It Yourself!
Below is an interactive visualization. Watch how the model generates text token by token, considering all previous context!
Watch how LLMs process your prompt and generate responses. The KV-Cache is crucial for efficient generation!
Imagine reading a book and taking notes:
- ✗Without cache: Re-read the ENTIRE book from the start each time
- ✓With cache: Check your notes — fast and efficient
Cache empty
Like notes in a notebook — stores "summaries" of previous tokens so the model doesn't re-read everything
What are K and V?
- KKey — "What is this token about?" (summary)
- VValue — "What's important?" (useful info)
Before generating the first word, the model must read and "understand" your entire prompt — all at once. This is called prefill, and it takes most of the time. After that, each new word is fast: the model just looks at its notes (KV-Cache) instead of re-reading everything.
KV-Cache trades memory for speed. Each token's K/V vectors (~1MB per token for large models) stay in memory. This limits context length — 128K context = ~128GB of KV-Cache!
Frequently asked questions
What is inference in an LLM?
Inference is running an already-trained model: you feed in text and the model predicts the next tokens. The weights stay frozen, unlike during training. This is the stage where the model generates its answer token by token.
What is the KV-cache and why does it matter?
The KV-cache stores the Keys and Values that attention already computed for past tokens. Without it, every new token would force the model to reprocess the whole history. With it, each step only handles the single newest token, which keeps long answers fast and affordable.
Why does a long context slow inference down?
A long prompt requires a heavy prefill pass — the model must read every token and fill the KV-cache before producing the first word. The cache grows with context length and consumes GPU memory. That is why a 100,000-token document adds a noticeable delay before the first word, even though tokens after that still stream at the same rate.
What is the difference between prefill and decode?
Prefill is the first parallel pass over the whole prompt, building a representation of all input tokens at once. Decode is the step-by-step generation where each pass emits exactly one new token. Prefill is fast and parallel, while decode is the slow part you see as the streaming text on screen.
Try it yourself
Interactive demo of this technique
Get a yes/no answer to a simple question
Yes, Python is an interpreted programming language. This means Python code is executed line by line by an interpreter, rather than being compiled entirely into machine code before execution. However, technically Python first compiles to bytecode (.pyc files), which is then executed by the Python virtual machine (CPython). There are also JIT compilers... [continues for 500 words]
Yes.
Inference is autoregressive: each token is generated sequentially. Reducing max_tokens from 4096 to 10 and specifying format sped up the response 18x without losing quality.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path