LLM Settings
Temperature, Top-p & more
The Problem: You ask ChatGPT the same question twice, but get different answers. Sometimes creative, sometimes dry. What's going on?
The Solution: Control Knobs on a Mixing Console
Imagine a DJ mixing console with sliders. Each slider affects the sound differently: bass, treble, volume. LLMs have similar "sliders" that control how text is generated during inference. At each step the model produces a probability distribution over the next possible token, and these settings decide how a single token is picked from that distribution. They don't change what the model knows — only how boldly or cautiously it samples from what it already believes.
Temperature: the creativity dial
Temperature rescales the probability distribution before sampling. Most APIs accept values from 0 to 2. Low temperature (0–0.3) sharpens the distribution so the most likely token almost always wins — answers become focused, repeatable, and conservative. High temperature (0.8–1.5) flattens the distribution, giving rarer tokens a real chance and producing more varied, surprising text. Use temperature 0 for data extraction, classification, or math where you want the same correct answer every time, and 0.8 for brainstorming or creative writing where variety is the point.
Top-P, Top-K, and the penalties
Top-P (nucleus sampling) takes a different approach: instead of rescaling, it keeps only the smallest set of tokens whose probabilities add up to P (e.g. 0.9 = the top 90% of probability mass) and samples from that set. It adapts automatically — a confident model considers few options, an uncertain one considers many. Top-K is the simpler cousin: it always keeps exactly the K most likely tokens (e.g. K=40), regardless of how confident the model is. A common recipe is to lower temperature OR tighten top-p, not both at once. Finally, the frequency penalty reduces the score of tokens the more often they have already appeared (curbing word-for-word repetition), while the presence penalty applies a flat reduction once a token appears at all (nudging the model toward new topics). Both usually range from 0 to 2. The max tokens setting is a hard cap on output length — set it generously for essays, tightly for one-line classifications to save cost and avoid runaway answers.
Think of it like a DJ mixing console with control sliders:
- 1. Temperature (0-2): "creativity knob". Low = predictable and focused. High = wild and creative
- 2. Top-P (0-1): "vocabulary width". Low = only the safest words. High = considers rare options too
- 3. Max Tokens: "response length limit". How many words the model can generate
- 4. Frequency Penalty: "repetition punisher". Makes the model avoid repeating the same words
- 5. Presence Penalty: "new topic encourager". Pushes the model to bring up new topics
Where Is This Used?
Fun Fact: Temperature = 0 doesn't guarantee identical answers! There are other sources of randomness in the system. For truly deterministic output, you also need to set a fixed "seed" value.
Try It Yourself!
Move the sliders and watch how the model's behavior changes. Try temperature 0 vs 2 — the difference is dramatic!
Move the sliders and watch how each parameter affects next-token selection and generation in real time.
Click a chip to activate a stop sequence
More creativity — less likely words get a chance
{
"temperature": 0.7,
"top_p": 0.90,
"max_tokens": 10Temperature and Top-P work together: low temperature (0.1-0.3) for facts and code, medium (0.5-0.7) for most tasks, high (0.8-1.2) for creative work. Avoid temperature > 1.5 — responses become nonsensical.
Frequently asked questions
What is temperature in an LLM?
Temperature is a sampling parameter that rescales the probability distribution before the next token is chosen. Low values (0–0.3) make answers focused and predictable, while high values (0.8–1.5) make them more varied and creative. Most APIs accept values from 0 to 2.
How does top-p differ from top-k?
Top-p (nucleus sampling) keeps the smallest set of tokens whose probabilities sum to P (e.g. 0.9), and that set adapts to how confident the model is. Top-k always keeps exactly the K most likely tokens (e.g. 40) regardless of confidence. Top-p is adaptive, while top-k is fixed by count.
Which temperature should I use for different tasks?
Use temperature 0 for data extraction, classification, and math, where you want the same correct answer every time. Code generation works well around 0.2–0.4, while brainstorming and creative writing benefit from roughly 0.8 or higher, where variety is the goal. As a rule, adjust either temperature or top-p, not both at once.
What is the difference between frequency penalty and presence penalty?
The frequency penalty lowers a token's score the more often it has already appeared, curbing word-for-word repetition. The presence penalty applies a flat reduction as soon as a token appears at all, nudging the model toward new topics. Both penalties typically range from 0 to 2.
Try it yourself
Interactive demo of this technique
Write a function to calculate the factorial of a number
def factorial(n):
if n == 0:
return 1
return n * factorial(n-1)
def factorial(n: int) -> int:
"""Calculate factorial of n.
Args:
n: Non-negative integer
Returns:
Factorial of n
Raises:
ValueError: If n is negative
"""
if n < 0:
raise ValueError("n must be non-negative")
if n == 0 or n == 1:
return 1
return n * factorial(n - 1)
Low temperature (0.1) for code ensures more consistent, documented, and safe results.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path