Self-Consistency — Improving Accuracy via Voting
Vote for the best answer
The Problem: AI can be inconsistent — ask the same question twice and you might get different answers. How can we increase confidence in the result?
The Solution: Ask Multiple Experts
Self-Consistency means generating multiple reasoning paths for the same prompt and picking the most common final answer. Instead of trusting one response, you ask the model to solve the problem several times and then take a "vote" on the answer that shows up most often. It builds directly on Chain-of-Thought: where standard CoT asks for a single step-by-step solution, Self-Consistency samples many such chains and keeps the answer they agree on.
How it works
The mechanism relies on temperature, the parameter that controls how random the model's sampling is. With temperature near 0 the model is nearly deterministic and would just repeat itself, so Self-Consistency uses a higher setting (typically 0.5–1.0) to make each run explore a slightly different line of reasoning. You generate, say, 5–40 completions, extract the final answer from each one, group identical answers together, and select the answer that appears in the most chains. The intuition is that a question usually has one correct answer reachable by several valid routes, but each wrong answer tends to come from a different mistake — so the correct answer accumulates votes while errors scatter.
When to use it, and the tradeoffs
Self-Consistency shines on tasks with a single verifiable answer: arithmetic, logic puzzles, commonsense reasoning, and coding problems. It is a poor fit for open-ended work like creative writing or summarization, where there is no "correct" answer to vote on. The main cost is obvious: running the prompt N times multiplies your token spend and latency by roughly N, with diminishing returns past ~10 samples. It also cannot rescue a model that is confidently wrong in the same way every time — if ahallucination is systematic rather than random, the majority will simply vote for the wrong answer. Worked example: ask "A shirt costs $40 after a 20% discount — what was the original price?" five times. Three chains correctly compute 40 ÷ 0.8 = $50, while two slip up and answer $48 (adding 20% instead of dividing). Majority voting returns $50, the right answer, even though almost half the individual attempts were wrong.
Think of it like consulting multiple experts:
- 1. Expert 1: "I think the answer is 42, because..."
- 2. Expert 2: "I calculated 42 using a different method..."
- 3. Expert 3: "My approach gives 38, here's why..."
- 4. Consensus: Two out of three say 42 — that's our answer!
Where Is This Used?
- Math Problems: Complex calculations where mistakes are likely
- Medical Diagnosis: Getting second and third opinions
- Code Review: Multiple analyses of potential bugs
- High-Stakes Decisions: Any task where accuracy is critical
Fun Fact: Self-consistency can boost accuracy by 5-15% on reasoning tasks! The key is to use "temperature" (randomness) so each attempt takes a slightly different path. Usually 5-10 samples are enough.
Try It Yourself!
Use the interactive example below to see how multiple reasoning paths can lead to more reliable answers through majority voting.
Instead of relying on a single answer, generate multiple reasoning paths (5-40 samples) at high temperature, then pick the most common final answer via majority voting.
Each sample produces a reasoning chain → final answer. Answers are grouped by value. The answer appearing in the most samples wins. Ties are broken by confidence or the first occurrence.
5 samples: ~5x cost, moderate improvement. 10 samples: sweet spot for most tasks. 40 samples: marginal gains. Set temperature 0.7-1.0 for diverse paths.
Best for: math, logic puzzles, commonsense reasoning, coding challenges. Not worth it for: creative writing, open-ended questions, tasks where there's no single correct answer.
🗳️ Self-Consistency — an improvement over Chain of Thought! Generate multiple different reasoning paths and choose the most frequent answer through voting. This helps avoid random errors!
A store had 12 apples and 8 oranges. They sold 5 fruits. If 3 of the sold fruits were apples, how many oranges are left?
Total fruits: 12 + 8 = 20. Sold 5, left 20 - 5 = 15. There were 8 oranges, so... about 6?
6 oranges
⚠️ Error in reasoning!
Create 5+ different reasoning paths with temperature > 0
Extract final answer from each path
Choose the most frequent answer (majority vote)
- Multi-step math problems
- Logic reasoning tasks
- When high accuracy is critical (medicine, finance)
- Questions where one error changes the whole answer
Self-Consistency works because even if one reasoning path contains an error, the correct answer appears more often in other paths. It's "wisdom of crowds" for LLMs! Downside: requires more tokens (5× calls), but accuracy improves by 10-30% for complex problems.
How to implement Self-Consistency
Self-Consistency is NOT a special prompt! It's a method for aggregating multiple responses:
- Run the same prompt multiple times
- Use temperature > 0 for diversity
- Collect answers and choose the most common
Step 1: Base prompt with CoT
Solve the task step by step:
{task}
Show your reasoning and give the answer.Regular Chain-of-Thought prompt. Nothing special yet.
Step 2: Generate multiple responses
Call the LLM 3-5 times with the same prompt, but with temperature > 0 (e.g., 0.7).
Each time you'll get different reasoning and possibly different answers. This is normal!
Step 3: Aggregation (code)
from collections import Counter
responses = [call_llm(prompt, temp=0.7) for _ in range(5)]
answers = [extract_answer(r) for r in responses]
final_answer = Counter(answers).most_common(1)[0][0]Use Counter to count votes. The most frequent answer wins!
Concrete prompt example
Task: What is 17 × 24?
Solve step by step:
1. Break down into simple operations
2. Calculate each one
3. Give the final answer
Answer:Run this prompt 5 times with temp=0.7. Collect answers. Choose the most frequent.
Summary:
- Self-Consistency = regular CoT prompt + multiple calls + voting
- The prompt does NOT change! Only the number of calls and aggregation changes
- Temperature > 0 is needed for answer diversity
- Final answer = most popular through simple counting
The technique is described in the paper "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (Wang et al., 2023). Showed 1-12% improvement on arithmetic and commonsense benchmarks compared to single-path CoT.
Frequently asked questions
What is Self-Consistency in prompting?
Self-Consistency is a technique where you generate multiple reasoning paths (typically 5–40) for the same prompt at a higher temperature, then pick the most common final answer by majority vote. It extends Chain-of-Thought: instead of one reasoning chain, the model tries many, and the answer they agree on wins.
How is Self-Consistency different from Chain-of-Thought?
Chain-of-Thought asks the model for a single step-by-step solution. Self-Consistency goes further by sampling many such chains at non-zero temperature and keeping the answer that appears most often. So Self-Consistency is a layer on top of CoT that boosts reliability through voting, at the cost of roughly N times more tokens.
How many samples does Self-Consistency need?
In practice 5–10 samples are enough to capture most of the accuracy gain. You can go up to 40, but returns diminish quickly past ~10 while token cost and latency grow linearly. Temperature is usually set between 0.5 and 1.0 so each run explores a slightly different reasoning path.
When does Self-Consistency fail?
It is useless on open-ended tasks like creative writing or summarization, where there is no single correct answer to vote on. It also cannot fix a systematic error: if the model is confidently wrong the same way every time, the majority simply votes for the wrong answer. It works best on math, logic, and coding problems with verifiable answers.
Try it yourself
Interactive demo of this technique
There are 23 apples in a basket. If you divide them equally among 5 children, how many apples will be left?
23 ÷ 5 = 4.6, so 3 apples will be left.
Method 1 (division with remainder): 23 ÷ 5 = 4 remainder 3 Left over: 3 apples
Method 2 (subtraction): 5×4 = 20 apples distributed 23 - 20 = 3 apples left
Method 3 (verification): 5×4 + 3 = 23 ✓ Left over: 3 apples
Final answer: 3 apples (all three methods gave the same result)
Self-Consistency verifies the answer multiple ways. When all methods give the same result, we're confident it's correct.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path