LLM Benchmarks
MMLU, HumanEval & more
The Problem: Every AI company claims their model is the best. Marketing highlights cherry-picked benchmarks. How can you objectively compare models and know which actually performs better for your use case?
The Solution: Evaluate Models Systematically
A benchmark is a standardized test that measures model performance on a fixed set of tasks using a fixed scoring rule. The key word is standardized: every model sees the same questions and is graded the same way, so a score becomes comparable across labs and across time. Think of crash tests for cars — an independent body runs the same impact on every vehicle, so a five-star rating means something no matter who built the car. Benchmarks come in families: knowledge tests like MMLU (multiple-choice across 57 subjects), reasoning tests like GPQA Diamond and AIME (hard science and competition math), agentic/coding tests like SWE-bench Verified (resolve real GitHub issues), and human-preference rankings like Chatbot Arena, where people vote on anonymous responses and the votes are turned into an Elo rating. Each family measures something different — a single number never captures a model.
How scoring actually works
Most knowledge and math benchmarks use exact-match accuracy: the model's answer is parsed and compared to a gold answer, and you report the percent correct. Coding benchmarks like SWE-bench run the model's patch against the project's real test suite — the task "passes" only if the tests go green, which is much harder to fake than picking a letter. Two details quietly change results: the prompt format (zero-shot vs few-shot, with or without chain-of-thought) and whether a model is allowed to "think" before answering. That is why reasoning models such as o3 jump on AIME but barely move on simple recall — they spend extra tokens deliberating. When you read a leaderboard, always check the eval settings, not just the headline percentage. These scores feed directly into model selection and can reveal where targeted fine-tuning would help.
The big pitfall: contamination
The number-one trap is benchmark contamination — the test questions (or close paraphrases) leaked into the model's training data, so it "knows the answers" instead of reasoning them out. A model can post a stellar MMLU score and still fail your actual task. Worked example: suppose you're choosing a model for a medical-support chatbot. Model A tops the public MMLU leaderboard at 90%; Model B sits at 86%. Instead of trusting the leaderboard, you build 80 real patient questions with doctor-verified answers and run both. On your private set Model B scores 81% and Model A only 68% — A had memorized public exam questions that don't resemble your messy real ones. The lesson: public benchmarks are a filter to shortlist candidates, but the decision must rest on a private, domain-specific eval that no model could have trained on.
Think of it like a test drive evaluation:
- 1. Define task categories from YOUR use case: What does the model need to do? Summarize reports? Answer medical questions? Write code? Debug existing code?
- 2. Check public benchmarks (beware contamination): Use GPQA, AIME, SWE-bench as filters — but remember models may have trained on test data. AIME/SWE-bench are harder to game because they use real-world problems
- 3. Collect 50-100 gold-standard pairs: Build input/output pairs with expert-verified correct answers — your ground truth dataset. This is the only evaluation that cannot be contaminated
- 4. Score models on YOUR data: Run each candidate on your dataset. Use LLM-as-judge (a stronger model grades weaker ones) for scalable evaluation. Measure accuracy, latency, and cost
- 5. Compare cost per quality point: A cheaper model at 90% accuracy may beat an expensive one at 95% — calculate the ROI. Reasoning models cost 3-5x more but may be worth it for math/logic tasks
- 6. Re-evaluate quarterly: Models improve fast — the best choice today may not be best in 3 months. Keep your evaluation pipeline automated
Key Benchmarks to Know
- MMLU: Multiple-choice knowledge across 57 subjects. Saturated — most frontier models score >88%, making it less useful for differentiation
- GPQA Diamond: PhD-level science questions. Harder than MMLU and still differentiates frontier models well. GPT-5 and Claude Opus 4.5 lead at ~87%
- AIME 2024: Real math competition problems. Reasoning models (o3: 91.6%, DeepSeek R1: 86.7%) dominate. Regular models score 50-74%. The biggest gap between reasoning and non-reasoning models
- SWE-bench Verified: Resolve real GitHub issues in real codebases. The most practical coding benchmark. Claude Opus 4.5 leads at 80.9% — far ahead of open-source models (~40-50%)
- Chatbot Arena (Elo): Human preference ranking: users compare anonymous model responses and vote. 6M+ votes compute Elo ratings. The most reliable "vibes" benchmark — measures what people actually prefer
- Custom Benchmarks: Public benchmarks show general capability. Custom benchmarks show if the model works for YOUR task. The model topping MMLU may not be best for your medical chatbot. Always build domain-specific evaluation
Fun Fact: On AIME 2024, reasoning model o3 scores 91.6% — outperforming 99% of human competitors. But on SWE-bench (real code), it scores only 61.2% while Claude Opus 4.5 reaches 80.9%. There is no universal "best" model — only the best model for YOUR task.
Try It Yourself!
Explore the interactive benchmark comparison below to see how different models perform across tasks.
Massive Multitask Language Understanding
Tests knowledge across 57 subjects from STEM to humanities. Saturated — most frontier models score >88%
How to Read Benchmarks
- Higher is better: But differences < 2% are usually not significant in practice
- Context matters: MMLU tests knowledge, SWE-bench tests real coding. Choose based on your task
- No model wins all: o3 leads math, Claude leads coding, GPT-5 leads knowledge. Pick for YOUR task
Benchmark Contamination
Models may have trained on benchmark data, inflating scores. AIME and SWE-bench are harder to contaminate because they use real-world problems. Always test on YOUR data — it is the only benchmark that cannot be gamed.
Frequently asked questions
What is an LLM benchmark in simple terms?
A benchmark is a standardized test that measures a model's quality on a fixed set of tasks using a fixed scoring rule. Every model gets the same questions and is graded the same way, so scores become comparable across labs and over time — much like crash tests for cars.
What's the difference between MMLU, GPQA, AIME and SWE-bench?
MMLU is a multiple-choice knowledge test across 57 subjects, now nearly saturated (>88% for frontier models). GPQA Diamond asks PhD-level science questions and still differentiates leaders well. AIME is competition math dominated by reasoning models. SWE-bench Verified is the most practical coding benchmark: the model must resolve real GitHub issues, and the patch only counts if the project's real tests pass.
Why can a model with a high benchmark score still fail on my task?
The main reason is contamination: the test questions (or paraphrases) leaked into training data, so the model recalls answers instead of reasoning. Public benchmarks also measure general ability, not your domain. Build 50–100 gold-standard input/output pairs for your task and evaluate candidates on that private set — it cannot be contaminated.
How do I compare several LLMs for my own project?
Define task categories from your use case, use public benchmarks (GPQA, AIME, SWE-bench) as a first filter, collect 50–100 expert-verified pairs, and run each model on that dataset. Use LLM-as-judge for scale. Measure accuracy, latency and cost, compute ROI (cost per quality point), and re-evaluate roughly every quarter since models change fast.
Try it yourself
Interactive demo of this technique
Evaluate model quality systematically instead of subjective assessment
Try a few articles and see if you like the results. If summaries are good — the model fits.
1. Test dataset: 100 articles
- 5 categories × 20 articles: politics, tech, sports, economics, science
- Golden summaries: written by experts (ground truth)
2. Metrics: Auto: ROUGE-L ≥ 0.35, BERTScore ≥ 0.85, length 40-80 tokens Manual (sample 20%): factual accuracy (1-5), completeness (1-5), readability (1-5)
3. Pass/fail criteria:
- ROUGE-L < 0.30 → FAIL
- Factual errors > 10% → FAIL
- BERTScore < 0.80 → WARNING
4. Process:
- CI: auto-metrics on every PR changing prompts
- Weekly: manual evaluation of 20 random production requests
- Monthly: full run of 100 articles when switching models
Systematic benchmark: dataset with golden answers + auto-metrics (ROUGE, BERTScore) + manual evaluation + pass/fail thresholds. Without this, changing model or prompt is a gamble.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path