Test-Time Compute Scaling
Make the model think longer instead of making it bigger
The Problem: Your model gets a hard math or coding problem wrong on the first try. The obvious fix — train a bigger model — is slow and expensive, and you already shipped these weights. But the right answer is often somewhere in the model: it just needs more attempts and a way to tell the good attempt from the bad ones. How do you extract better answers from the same weights?
The Solution: Test-Time Compute — Think Longer, Not Bigger
There are two ways to make a model better at a hard question. One is to make the model bigger — more parameters, more training. The other is to let it think longer at inference: spend more compute per query without changing the weights. Instead of committing to the first answer, the model samples many reasoning paths, optionally searches over steps, and self-verifies — then aggregates the candidates by majority vote (self-consistency) or by picking the best one a verifier approves (best-of-N). Accuracy rises with the thinking budget, but with diminishing returns — the curve eventually flattens. This inference-time scaling is the axis behind 2025-2026 reasoning models like o1, o3, and DeepSeek-R1.
Think of it like an exam where you can either be a genius with a bigger brain, or simply get more time and scratch paper — for hard problems, more time and a way to check your work often beats raw talent:
- 1. Generate multiple reasoning paths: Sample the same question N times with non-zero temperature so each run explores a different chain of thought. This is parallel scaling — N independent attempts at the same problem
- 2. Score or verify each path: For best-of-N, run a verifier or reward model (or unit tests for code) to rate each candidate. For self-consistency, just extract the final answer from each path — no verifier needed
- 3. Aggregate the candidates: Majority-vote the extracted answers (self-consistency) or pick the single best one the verifier approves (best-of-N). Aggregation is where scattered errors cancel out and the consensus answer wins
- 4. For harder problems, search deeper: When a problem is hard enough that flat sampling fails, switch to sequential scaling: beam or tree search over reasoning steps, expanding promising branches with more budget. Stop once the accuracy curve flattens — extra compute past that point is wasted
Best-of-N needs a verifier or reward model to score candidates; self-consistency needs none and just takes the majority answer. Both cost roughly N times more compute than a single answer — budget that spend for the queries that actually need it.
Where Test-Time Compute Shines
- Math & Coding Under a Quality Bar: Competition math, algorithm design, and unit-tested code have checkable answers. Sample N solutions, run the verifier or tests, and keep the one that passes — accuracy jumps without touching the weights
- Agentic Planning with Verification: An agent can draft several plans, simulate or critique each, and execute the one that survives review. Extra inference compute buys reliability on multi-step tasks where a single greedy plan often fails
- Hard Reasoning Tasks: For genuinely hard problems, a single forward pass rarely lands the right chain of reasoning. Sampling many paths and majority-voting (self-consistency) recovers the correct answer when good paths agree and errors scatter
- Budget-Controlled Quality: Test-time compute is a dial, not a fixed cost. Spend extra samples and search only on the queries that need it — easy questions get one shot, hard ones get a larger thinking budget — keeping average cost low
Fun Fact: Compute-optimal studies found that for many problems a smaller model given more test-time compute can match or beat a model over 10x larger answering once — at lower total cost. The catch: it works best when the task has a checkable answer, so a verifier or majority vote can reliably pick the winner among the samples.
Try It Yourself!
Try the interactive thinking-budget dial below: drag it to see the accuracy curve climb and then flatten, watch N samples flow through a verifier and a vote, and compare a small model with more compute against a big model answering once.
Drag the thinking budget. Accuracy climbs as you sample more reasoning paths — then flattens (diminishing returns).
Accuracy
57%
Relative cost
1×
Marginal gain
+0.0%
Frequently asked questions
What is test-time compute scaling?
Test-time (inference-time) compute scaling means spending more compute per query at inference instead of training a model with more parameters. Rather than generating one answer, the model samples many candidate reasoning paths, searches over steps, and self-verifies — then aggregates the results into a better final answer. The model's weights are unchanged; you simply give it more 'thinking budget' per question. This is the axis behind 2025-2026 reasoning models like OpenAI o1/o3 and DeepSeek-R1, whose accuracy keeps climbing as you allow more inference compute.
What is the difference between best-of-N and self-consistency?
Both generate N candidate answers, but they select differently. Best-of-N uses a separate scorer or verifier (a reward model or a checker) to rate each candidate and picks the single highest-scoring one — useful when you have a reliable verifier. Self-consistency skips the verifier: it samples N independent chains-of-thought, extracts the final answer from each, and takes the majority vote. Self-consistency works well when correct reasoning paths agree on the answer while errors are scattered, so the right answer wins the vote. Best-of-N can beat it when a strong verifier exists; self-consistency is simpler and needs no extra model.
When does a small model with more compute beat a bigger model?
Compute-optimal allocation studies show that for many hard but verifiable problems, a smaller model given more test-time compute (more samples plus search and verification) can match or beat a larger model answering once — at lower total cost. This holds best when the task has a checkable answer (math, code, logic) so a verifier or majority vote can reliably pick the winner, and when the small model's single-shot accuracy is high enough that some of its N samples land on the correct answer. For very hard problems where the small model almost never finds the right path, scaling the model is still better. The practical rule: spend extra inference compute only on the queries that need it, and cap it once the accuracy curve flattens.
Try it yourself
Interactive demo of this technique
Solve a tricky problem where a single greedy answer is often wrong
The probability is about 23/365 ≈ 6.3%. Since there are 23 people and 365 days, the chance of a match is small.
5 reasoning paths:
- Path 1: P = 1 − 365!/(342!·365²³) ≈ 50.7%
- Path 2: complement, 1 − ∏(1 − k/365) ≈ 50.7%
- Path 3: wrongly uses 23/365 ≈ 6.3% (naive)
- Path 4: 1 − (364/365)^(23·22/2) ≈ 50.0%
- Path 5: exact calculation ≈ 50.7%
Majority vote: 4 of 5 paths converge on ~50.7%. The scattered error (6.3%) loses.
Answer: ≈ 50.7% — the chance of a shared birthday already exceeds 50% at 23 people, because ALL pairs are counted (253 pairs), not a comparison against one person.
Test-time compute does not change the weights — it samples several paths and votes. The correct answer wins because the valid chains agree while the single naive error is scattered and loses.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path