Decoding Strategies
Token selection methods
The Problem: The model knows that after "The weather today is" several words could fit: "sunny", "rainy", "cold", "nice"... How does it pick just one?
The Solution: Autocomplete on Your Phone
Remember how your phone suggests words when you type a message? It shows 3 options, and you tap one. LLMs work similarly, but they choose automatically! Decoding is the final step of inference where a probability distribution becomes an actual token.
Why Does This Matter?
The decoding strategy determines the "personality" of the model's output:
- Customer support bot: greedy decoding for consistent, predictable answers
- Creative writing: sampling with high temperature for unexpected ideas
- Code generation: beam search to find the most correct solution
- Translation: beam search to maintain meaning and grammar
Think of it like phone autocomplete with different selection strategies:
- 1. Greedy: always pick the most likely word. Fast, but boring — texts become repetitive
- 2. Random (sampling): roll a dice among top options. More creative, but can go off-topic
- 3. Top-K: consider only the K most likely words, then pick randomly. A balance between control and creativity
- 4. Top-P (nucleus): consider words until their total probability reaches P%. Smart adaptation to context
- 5. Beam Search: try several paths in parallel, choose the best overall. Used for translation and important tasks
Where Is This Used?
Fun Fact: ChatGPT uses Top-P sampling with P=1 by default, meaning it considers almost all options but favors more likely ones. That's why the same question can get different answers each time!
Try It Yourself!
See how different decoding strategies affect the generated text. Try greedy vs random — notice the difference!
How does the model choose the next token? Different strategies produce different outputs from the same probability distribution.
Always pick the most likely token. Deterministic but can be repetitive.
| Strategy | Deterministic | Diversity | Use Case |
|---|---|---|---|
| Greedy | ✓ | Low | Facts, code |
| Beam Search | ✓ | Medium | Translation, summarization |
| Temperature | ✗ | Adjustable | Everything |
| Top-k | ✗ | Medium | Creative |
| Top-p | ✗ | Adaptive | Chat, creative |
Greedy is best for factual queries. Top-p (nucleus) sampling is the most popular for creative tasks — it adapts to each distribution rather than using a fixed cutoff like Top-k.
Try it yourself
Interactive demo of this technique
Write the opening of a sci-fi story
The spaceship "Hope" flew through the vast expanse of space. Captain Johnson sat in his chair and looked at the stars. A long journey lay ahead. The crew was preparing to land on a new planet.
The beacon on Enceladus went silent on Thursday. Not that anyone noticed — automated stations go quiet and reboot all the time. But this beacon had been transmitting the same phrase for 340 years straight. And now it started transmitting a different one. In a language that did not exist in humanity's databases.
Greedy decoding (temperature=0) produces predictable text. Nucleus sampling (temperature=0.9, top_p=0.95) lets the model "take risks" — ideal for creative tasks where cliches are unacceptable.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path