Lesson 23New

Test-Time Compute Scaling

Make the model think longer instead of making it bigger

The Problem: Your model gets a hard math or coding problem wrong on the first try. The obvious fix — train a bigger model — is slow and expensive, and you already shipped these weights. But the right answer is often somewhere in the model: it just needs more attempts and a way to tell the good attempt from the bad ones. How do you extract better answers from the same weights?

The Solution: Test-Time Compute — Think Longer, Not Bigger

There are two ways to make a model better at a hard question. One is to make the model bigger — more parameters, more training. The other is to let it think longer at inference: spend more compute per query without changing the weights. Instead of committing to the first answer, the model samples many reasoning paths, optionally searches over steps, and self-verifies — then aggregates the candidates by majority vote (self-consistency) or by picking the best one a verifier approves (best-of-N). Accuracy rises with the thinking budget, but with diminishing returns — the curve eventually flattens. This inference-time scaling is the axis behind 2025-2026 reasoning models like o1, o3, and DeepSeek-R1.

Think of it like an exam where you can either be a genius with a bigger brain, or simply get more time and scratch paper — for hard problems, more time and a way to check your work often beats raw talent:

1. Generate multiple reasoning paths: Sample the same question N times with non-zero temperature so each run explores a different chain of thought. This is parallel scaling — N independent attempts at the same problem
2. Score or verify each path: For best-of-N, run a verifier or reward model (or unit tests for code) to rate each candidate. For self-consistency, just extract the final answer from each path — no verifier needed
3. Aggregate the candidates: Majority-vote the extracted answers (self-consistency) or pick the single best one the verifier approves (best-of-N). Aggregation is where scattered errors cancel out and the consensus answer wins
4. For harder problems, search deeper: When a problem is hard enough that flat sampling fails, switch to sequential scaling: beam or tree search over reasoning steps, expanding promising branches with more budget. Stop once the accuracy curve flattens — extra compute past that point is wasted

Best-of-N needs a verifier or reward model to score candidates; self-consistency needs none and just takes the majority answer. Both cost roughly N times more compute than a single answer — budget that spend for the queries that actually need it.

Where Test-Time Compute Shines

Math & Coding Under a Quality Bar: Competition math, algorithm design, and unit-tested code have checkable answers. Sample N solutions, run the verifier or tests, and keep the one that passes — accuracy jumps without touching the weights
Agentic Planning with Verification: An agent can draft several plans, simulate or critique each, and execute the one that survives review. Extra inference compute buys reliability on multi-step tasks where a single greedy plan often fails
Hard Reasoning Tasks: For genuinely hard problems, a single forward pass rarely lands the right chain of reasoning. Sampling many paths and majority-voting (self-consistency) recovers the correct answer when good paths agree and errors scatter
Budget-Controlled Quality: Test-time compute is a dial, not a fixed cost. Spend extra samples and search only on the queries that need it — easy questions get one shot, hard ones get a larger thinking budget — keeping average cost low

Fun Fact: Compute-optimal studies found that for many problems a smaller model given more test-time compute can match or beat a model over 10x larger answering once — at lower total cost. The catch: it works best when the task has a checkable answer, so a verifier or majority vote can reliably pick the winner among the samples.

Try It Yourself!

Try the interactive thinking-budget dial below: drag it to see the accuracy curve climb and then flatten, watch N samples flow through a verifier and a vote, and compare a small model with more compute against a big model answering once.

Thinking Budget: More Compute, Higher Accuracy

Drag the thinking budget. Accuracy climbs as you sample more reasoning paths — then flattens (diminishing returns).

Thinking budget (N samples)N = 1

Single shotDeep search

Accuracy

57%

Relative cost

1×

Marginal gain

+0.0%

Still climbing — more thinking budget is buying real accuracy here. This is the productive part of the curve.

Frequently asked questions

What is test-time compute scaling?

Test-time (inference-time) compute scaling means spending more compute per query at inference instead of training a model with more parameters. Rather than generating one answer, the model samples many candidate reasoning paths, searches over steps, and self-verifies — then aggregates the results into a better final answer. The model's weights are unchanged; you simply give it more 'thinking budget' per question. This is the axis behind 2025-2026 reasoning models like OpenAI o1/o3 and DeepSeek-R1, whose accuracy keeps climbing as you allow more inference compute.

What is the difference between best-of-N and self-consistency?

Both generate N candidate answers, but they select differently. Best-of-N uses a separate scorer or verifier (a reward model or a checker) to rate each candidate and picks the single highest-scoring one — useful when you have a reliable verifier. Self-consistency skips the verifier: it samples N independent chains-of-thought, extracts the final answer from each, and takes the majority vote. Self-consistency works well when correct reasoning paths agree on the answer while errors are scattered, so the right answer wins the vote. Best-of-N can beat it when a strong verifier exists; self-consistency is simpler and needs no extra model.

When does a small model with more compute beat a bigger model?

Compute-optimal allocation studies show that for many hard but verifiable problems, a smaller model given more test-time compute (more samples plus search and verification) can match or beat a larger model answering once — at lower total cost. This holds best when the task has a checkable answer (math, code, logic) so a verifier or majority vote can reliably pick the winner, and when the small model's single-shot accuracy is high enough that some of its N samples land on the correct answer. For very hard problems where the small model almost never finds the right path, scaling the model is still better. The practical rule: spend extra inference compute only on the queries that need it, and cap it once the accuracy curve flattens.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateMath

Solve a tricky problem where a single greedy answer is often wrong

Without technique

Prompt

Solve: in a room of 23 people, what is the probability that at least two share a birthday? Give the answer.

Response

The probability is about 23/365 ≈ 6.3%. Since there are 23 people and 365 days, the chance of a match is small.

Tokens:38/34

Time:380ms

Quality:

With fundamentals-test-time-compute

With technique

Prompt

Solve the birthday-paradox problem for 23 people using test-time compute: generate 5 independent reasoning chains, extract the final answer from each, and take the majority vote (self-consistency). Show the spread.

Response

5 reasoning paths:

Path 1: P = 1 − 365!/(342!·365²³) ≈ 50.7%
Path 2: complement, 1 − ∏(1 − k/365) ≈ 50.7%
Path 3: wrongly uses 23/365 ≈ 6.3% (naive)
Path 4: 1 − (364/365)^(23·22/2) ≈ 50.0%
Path 5: exact calculation ≈ 50.7%

Majority vote: 4 of 5 paths converge on ~50.7%. The scattered error (6.3%) loses.

Answer: ≈ 50.7% — the chance of a shared birthday already exceeds 50% at 23 people, because ALL pairs are counted (253 pairs), not a comparison against one person.

👁️A single greedy answer fell into the classic trap: compared 23 to 365 and gave 6.3%

🧠Sample 5 independent paths — 4 of them reach ~50.7%, the naive error is scattered

✅Majority vote selects the consensus answer 50.7%, discarding the outlier

Tokens:58/196

Time:2100ms

Quality:

Why this works

Test-time compute does not change the weights — it samples several paths and votes. The correct answer wins because the valid chains agree while the single naive error is scattered and loses.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Reasoning Models Inference Self Consistency

This lesson is part of a structured LLM course.

My Learning Path

Lesson 23New

Test-Time Compute Scaling

Make the model think longer instead of making it bigger

The Solution: Test-Time Compute — Think Longer, Not Bigger

Think of it like an exam where you can either be a genius with a bigger brain, or simply get more time and scratch paper — for hard problems, more time and a way to check your work often beats raw talent:

1. Generate multiple reasoning paths: Sample the same question N times with non-zero temperature so each run explores a different chain of thought. This is parallel scaling — N independent attempts at the same problem
2. Score or verify each path: For best-of-N, run a verifier or reward model (or unit tests for code) to rate each candidate. For self-consistency, just extract the final answer from each path — no verifier needed
3. Aggregate the candidates: Majority-vote the extracted answers (self-consistency) or pick the single best one the verifier approves (best-of-N). Aggregation is where scattered errors cancel out and the consensus answer wins
4. For harder problems, search deeper: When a problem is hard enough that flat sampling fails, switch to sequential scaling: beam or tree search over reasoning steps, expanding promising branches with more budget. Stop once the accuracy curve flattens — extra compute past that point is wasted

Where Test-Time Compute Shines

Math & Coding Under a Quality Bar: Competition math, algorithm design, and unit-tested code have checkable answers. Sample N solutions, run the verifier or tests, and keep the one that passes — accuracy jumps without touching the weights
Agentic Planning with Verification: An agent can draft several plans, simulate or critique each, and execute the one that survives review. Extra inference compute buys reliability on multi-step tasks where a single greedy plan often fails
Hard Reasoning Tasks: For genuinely hard problems, a single forward pass rarely lands the right chain of reasoning. Sampling many paths and majority-voting (self-consistency) recovers the correct answer when good paths agree and errors scatter
Budget-Controlled Quality: Test-time compute is a dial, not a fixed cost. Spend extra samples and search only on the queries that need it — easy questions get one shot, hard ones get a larger thinking budget — keeping average cost low

Try It Yourself!

Thinking Budget: More Compute, Higher Accuracy

Drag the thinking budget. Accuracy climbs as you sample more reasoning paths — then flattens (diminishing returns).

Thinking budget (N samples)N = 1

Single shotDeep search

Accuracy

57%

Relative cost

1×

Marginal gain

+0.0%

Still climbing — more thinking budget is buying real accuracy here. This is the productive part of the curve.

Frequently asked questions

What is test-time compute scaling?

What is the difference between best-of-N and self-consistency?

When does a small model with more compute beat a bigger model?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateMath

Solve a tricky problem where a single greedy answer is often wrong

Without technique

Prompt

Solve: in a room of 23 people, what is the probability that at least two share a birthday? Give the answer.

Response

The probability is about 23/365 ≈ 6.3%. Since there are 23 people and 365 days, the chance of a match is small.

Tokens:38/34

Time:380ms

Quality:

With fundamentals-test-time-compute

With technique

Prompt

Response

5 reasoning paths:

Path 1: P = 1 − 365!/(342!·365²³) ≈ 50.7%
Path 2: complement, 1 − ∏(1 − k/365) ≈ 50.7%
Path 3: wrongly uses 23/365 ≈ 6.3% (naive)
Path 4: 1 − (364/365)^(23·22/2) ≈ 50.0%
Path 5: exact calculation ≈ 50.7%

Majority vote: 4 of 5 paths converge on ~50.7%. The scattered error (6.3%) loses.

Answer: ≈ 50.7% — the chance of a shared birthday already exceeds 50% at 23 people, because ALL pairs are counted (253 pairs), not a comparison against one person.

👁️A single greedy answer fell into the classic trap: compared 23 to 365 and gave 6.3%

🧠Sample 5 independent paths — 4 of them reach ~50.7%, the naive error is scattered

✅Majority vote selects the consensus answer 50.7%, discarding the outlier

Tokens:58/196

Time:2100ms

Quality:

Why this works

Test-time compute does not change the weights — it samples several paths and votes. The correct answer wins because the valid chains agree while the single naive error is scattered and loses.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Reasoning Models Inference Self Consistency

This lesson is part of a structured LLM course.

My Learning Path