Inference & KV-Cache
How LLMs generate text
The Problem: You type a question, and the AI gives you a complete answer. But how does it actually "think" and produce text? Does it have the whole answer ready in advance?
The Solution: A Typewriter with a Notebook
Imagine an old typewriter that can only type one letter at a time. But next to it lies a notebook with notes about "what usually comes next." Settings like temperature control how adventurous each choice is.
LLMs don't generate entire sentences at once. The Transformer produces text token by token (words or parts of words). Each new token depends on all the previous ones. To avoid re-computing past tokens, models use a KV-cache.
Think of it like a typewriter with a notebook:
- 1. You type: "The capital of France is"
- 2. Looks at ALL previous text: The typewriter looks at ALL previous text
- 3. Checks the notebook: After such a phrase, the word "Paris" often comes
- 4. Types: "Paris"
- 5. Now looks at: "The capital of France is Paris"
- 6. Decides what comes next: Maybe a period, or ", which is known for..."
This process is called inference — the model "infers" what word should come next.
Where Is This Used?
- ChatGPT/Claude: each answer is generated word by word (that's why you see text appearing gradually)
- Code completion: GitHub Copilot predicts the next line
- Translation: models translate one word at a time
- Text summarization: summary is built piece by piece
Fun Fact: GPT-4 generates about 50-100 tokens per second. Each token requires billions of calculations! That's why powerful GPUs are needed — they perform thousands of operations in parallel.
Try It Yourself!
Below is an interactive visualization. Watch how the model generates text token by token, considering all previous context!
Watch how LLMs process your prompt and generate responses. The KV-Cache is crucial for efficient generation!
Imagine reading a book and taking notes:
- ✗Without cache: Re-read the ENTIRE book from the start each time
- ✓With cache: Check your notes — fast and efficient
Cache empty
Like notes in a notebook — stores "summaries" of previous tokens so the model doesn't re-read everything
What are K and V?
- KKey — "What is this token about?" (summary)
- VValue — "What's important?" (useful info)
Before generating the first word, the model must read and "understand" your entire prompt — all at once. This is called prefill, and it takes most of the time. After that, each new word is fast: the model just looks at its notes (KV-Cache) instead of re-reading everything.
KV-Cache trades memory for speed. Each token's K/V vectors (~1MB per token for large models) stay in memory. This limits context length — 128K context = ~128GB of KV-Cache!
Try it yourself
Interactive demo of this technique
Get a yes/no answer to a simple question
Yes, Python is an interpreted programming language. This means Python code is executed line by line by an interpreter, rather than being compiled entirely into machine code before execution. However, technically Python first compiles to bytecode (.pyc files), which is then executed by the Python virtual machine (CPython). There are also JIT compilers... [continues for 500 words]
Yes.
Inference is autoregressive: each token is generated sequentially. Reducing max_tokens from 4096 to 10 and specifying format sped up the response 18x without losing quality.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path