Lesson 5New

LLM Settings

Temperature, Top-p & more

The Problem: You ask ChatGPT the same question twice, but get different answers. Sometimes creative, sometimes dry. What's going on?

The Solution: Control Knobs on a Mixing Console

Imagine a DJ mixing console with sliders. Each slider affects the sound differently: bass, treble, volume. LLMs have similar "sliders" that control how text is generated during inference. At each step the model produces a probability distribution over the next possible token, and these settings decide how a single token is picked from that distribution. They don't change what the model knows — only how boldly or cautiously it samples from what it already believes.

Temperature: the creativity dial

Temperature rescales the probability distribution before sampling. Most APIs accept values from 0 to 2. Low temperature (0–0.3) sharpens the distribution so the most likely token almost always wins — answers become focused, repeatable, and conservative. High temperature (0.8–1.5) flattens the distribution, giving rarer tokens a real chance and producing more varied, surprising text. Use temperature 0 for data extraction, classification, or math where you want the same correct answer every time, and 0.8 for brainstorming or creative writing where variety is the point.

Top-P, Top-K, and the penalties

Top-P (nucleus sampling) takes a different approach: instead of rescaling, it keeps only the smallest set of tokens whose probabilities add up to P (e.g. 0.9 = the top 90% of probability mass) and samples from that set. It adapts automatically — a confident model considers few options, an uncertain one considers many. Top-K is the simpler cousin: it always keeps exactly the K most likely tokens (e.g. K=40), regardless of how confident the model is. A common recipe is to lower temperature OR tighten top-p, not both at once. Finally, the frequency penalty reduces the score of tokens the more often they have already appeared (curbing word-for-word repetition), while the presence penalty applies a flat reduction once a token appears at all (nudging the model toward new topics). Both usually range from 0 to 2. The max tokens setting is a hard cap on output length — set it generously for essays, tightly for one-line classifications to save cost and avoid runaway answers.

Think of it like a DJ mixing console with control sliders:

1. Temperature (0-2): "creativity knob". Low = predictable and focused. High = wild and creative
2. Top-P (0-1): "vocabulary width". Low = only the safest words. High = considers rare options too
3. Max Tokens: "response length limit". How many words the model can generate
4. Frequency Penalty: "repetition punisher". Makes the model avoid repeating the same words
5. Presence Penalty: "new topic encourager". Pushes the model to bring up new topics

Where Is This Used?

Fun Fact: Temperature = 0 doesn't guarantee identical answers! There are other sources of randomness in the system. For truly deterministic output, you also need to set a fixed "seed" value.

Try It Yourself!

Move the sliders and watch how the model's behavior changes. Try temperature 0 vs 2 — the difference is dramatic!

LLM Settings

Move the sliders and watch how each parameter affects next-token selection and generation in real time.

Temperature

0.7

FocusedCreative

Top-P (Nucleus)

0.90

NarrowWide

Max Tokens

ShortLong

Stop Sequences

Click a chip to activate a stop sequence

Next token selection (softmax)

Top-P = 0.90 → 2 of 8 tokens in pool

Paris

82.0%

the

11.1%

Top-P cutoff

beautiful

4.1%

capital

1.7%

France

0.7%

city

0.3%

magnificent

<0.1%

banana

<0.1%

More creativity — less likely words get a chance

{
  "temperature": 0.7,
  "top_p": 0.90,
  "max_tokens": 10

Prompt: "What is the capital of France?"

Generated response:

The

capital

France

Paris

beautiful

city

known

for

its

art

and

culture

END

10 / 20 tokens

Key Insight

Temperature and Top-P work together: low temperature (0.1-0.3) for facts and code, medium (0.5-0.7) for most tasks, high (0.8-1.2) for creative work. Avoid temperature > 1.5 — responses become nonsensical.

Frequently asked questions

What is temperature in an LLM?

Temperature is a sampling parameter that rescales the probability distribution before the next token is chosen. Low values (0–0.3) make answers focused and predictable, while high values (0.8–1.5) make them more varied and creative. Most APIs accept values from 0 to 2.

How does top-p differ from top-k?

Top-p (nucleus sampling) keeps the smallest set of tokens whose probabilities sum to P (e.g. 0.9), and that set adapts to how confident the model is. Top-k always keeps exactly the K most likely tokens (e.g. 40) regardless of confidence. Top-p is adaptive, while top-k is fixed by count.

Which temperature should I use for different tasks?

Use temperature 0 for data extraction, classification, and math, where you want the same correct answer every time. Code generation works well around 0.2–0.4, while brainstorming and creative writing benefit from roughly 0.8 or higher, where variety is the goal. As a rule, adjust either temperature or top-p, not both at once.

What is the difference between frequency penalty and presence penalty?

The frequency penalty lowers a token's score the more often it has already appeared, curbing word-for-word repetition. The presence penalty applies a flat reduction as soon as a token appears at all, nudging the model toward new topics. Both penalties typically range from 0 to 2.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerCoding

Write a function to calculate the factorial of a number

Without technique

Prompt

Write a factorial function in Python.

Response

def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n-1)

Tokens:15/35

Time:280ms

Quality:

With LLM Settings

With technique

Prompt

Write a factorial function in Python. [Settings: temperature=0.1, top_p=0.95]

Response

def factorial(n: int) -> int:
    """Calculate factorial of n.
    
    Args:
        n: Non-negative integer
    Returns:
        Factorial of n
    Raises:
        ValueError: If n is negative
    """
    if n < 0:
        raise ValueError("n must be non-negative")
    if n == 0 or n == 1:
        return 1
    return n * factorial(n - 1)

👁️Low temperature gives more predictable code

👁️Model added type hints and docstring

✅Error handling included

Tokens:25/95

Time:420ms

Quality:

Why this works

Low temperature (0.1) for code ensures more consistent, documented, and safe results.

1 / 7

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Decoding Inference

This lesson is part of a structured LLM course.

My Learning Path

Lesson 5New

LLM Settings

Temperature, Top-p & more

The Problem: You ask ChatGPT the same question twice, but get different answers. Sometimes creative, sometimes dry. What's going on?

The Solution: Control Knobs on a Mixing Console

Temperature: the creativity dial

Top-P, Top-K, and the penalties

Think of it like a DJ mixing console with control sliders:

1. Temperature (0-2): "creativity knob". Low = predictable and focused. High = wild and creative
2. Top-P (0-1): "vocabulary width". Low = only the safest words. High = considers rare options too
3. Max Tokens: "response length limit". How many words the model can generate
4. Frequency Penalty: "repetition punisher". Makes the model avoid repeating the same words
5. Presence Penalty: "new topic encourager". Pushes the model to bring up new topics

Where Is This Used?

Fun Fact: Temperature = 0 doesn't guarantee identical answers! There are other sources of randomness in the system. For truly deterministic output, you also need to set a fixed "seed" value.

Try It Yourself!

Move the sliders and watch how the model's behavior changes. Try temperature 0 vs 2 — the difference is dramatic!

LLM Settings

Move the sliders and watch how each parameter affects next-token selection and generation in real time.

Temperature

0.7

FocusedCreative

Top-P (Nucleus)

0.90

NarrowWide

Max Tokens

ShortLong

Stop Sequences

Click a chip to activate a stop sequence

Next token selection (softmax)

Top-P = 0.90 → 2 of 8 tokens in pool

Paris

82.0%

the

11.1%

Top-P cutoff

beautiful

4.1%

capital

1.7%

France

0.7%

city

0.3%

magnificent

<0.1%

banana

<0.1%

More creativity — less likely words get a chance

{
  "temperature": 0.7,
  "top_p": 0.90,
  "max_tokens": 10

Prompt: "What is the capital of France?"

Generated response:

The

capital

France

Paris

beautiful

city

known

for

its

art

and

culture

END

10 / 20 tokens

Key Insight

Frequently asked questions

What is temperature in an LLM?

How does top-p differ from top-k?

Which temperature should I use for different tasks?

What is the difference between frequency penalty and presence penalty?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerCoding

Write a function to calculate the factorial of a number

Without technique

Prompt

Write a factorial function in Python.

Response

def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n-1)

Tokens:15/35

Time:280ms

Quality:

With LLM Settings

With technique

Prompt

Write a factorial function in Python. [Settings: temperature=0.1, top_p=0.95]

Response

def factorial(n: int) -> int:
    """Calculate factorial of n.
    
    Args:
        n: Non-negative integer
    Returns:
        Factorial of n
    Raises:
        ValueError: If n is negative
    """
    if n < 0:
        raise ValueError("n must be non-negative")
    if n == 0 or n == 1:
        return 1
    return n * factorial(n - 1)

👁️Low temperature gives more predictable code

👁️Model added type hints and docstring

✅Error handling included

Tokens:25/95

Time:420ms

Quality:

Why this works

Low temperature (0.1) for code ensures more consistent, documented, and safe results.

1 / 7

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Decoding Inference

This lesson is part of a structured LLM course.

My Learning Path