Lesson 2

LLM Benchmarks

MMLU, HumanEval & more

The Problem: Every AI company claims their model is the best. Marketing highlights cherry-picked benchmarks. How can you objectively compare models and know which actually performs better for your use case?

The Solution: Evaluate Models Systematically

A benchmark is a standardized test that measures model performance on a fixed set of tasks using a fixed scoring rule. The key word is standardized: every model sees the same questions and is graded the same way, so a score becomes comparable across labs and across time. Think of crash tests for cars — an independent body runs the same impact on every vehicle, so a five-star rating means something no matter who built the car. Benchmarks come in families: knowledge tests like MMLU (multiple-choice across 57 subjects), reasoning tests like GPQA Diamond and AIME (hard science and competition math), agentic/coding tests like SWE-bench Verified (resolve real GitHub issues), and human-preference rankings like Chatbot Arena, where people vote on anonymous responses and the votes are turned into an Elo rating. Each family measures something different — a single number never captures a model.

How scoring actually works

Most knowledge and math benchmarks use exact-match accuracy: the model's answer is parsed and compared to a gold answer, and you report the percent correct. Coding benchmarks like SWE-bench run the model's patch against the project's real test suite — the task "passes" only if the tests go green, which is much harder to fake than picking a letter. Two details quietly change results: the prompt format (zero-shot vs few-shot, with or without chain-of-thought) and whether a model is allowed to "think" before answering. That is why reasoning models such as o3 jump on AIME but barely move on simple recall — they spend extra tokens deliberating. When you read a leaderboard, always check the eval settings, not just the headline percentage. These scores feed directly into model selection and can reveal where targeted fine-tuning would help.

The big pitfall: contamination

The number-one trap is benchmark contamination — the test questions (or close paraphrases) leaked into the model's training data, so it "knows the answers" instead of reasoning them out. A model can post a stellar MMLU score and still fail your actual task. Worked example: suppose you're choosing a model for a medical-support chatbot. Model A tops the public MMLU leaderboard at 90%; Model B sits at 86%. Instead of trusting the leaderboard, you build 80 real patient questions with doctor-verified answers and run both. On your private set Model B scores 81% and Model A only 68% — A had memorized public exam questions that don't resemble your messy real ones. The lesson: public benchmarks are a filter to shortlist candidates, but the decision must rest on a private, domain-specific eval that no model could have trained on.

Think of it like a test drive evaluation:

1. Define task categories from YOUR use case: What does the model need to do? Summarize reports? Answer medical questions? Write code? Debug existing code?
2. Check public benchmarks (beware contamination): Use GPQA, AIME, SWE-bench as filters — but remember models may have trained on test data. AIME/SWE-bench are harder to game because they use real-world problems
3. Collect 50-100 gold-standard pairs: Build input/output pairs with expert-verified correct answers — your ground truth dataset. This is the only evaluation that cannot be contaminated
4. Score models on YOUR data: Run each candidate on your dataset. Use LLM-as-judge (a stronger model grades weaker ones) for scalable evaluation. Measure accuracy, latency, and cost
5. Compare cost per quality point: A cheaper model at 90% accuracy may beat an expensive one at 95% — calculate the ROI. Reasoning models cost 3-5x more but may be worth it for math/logic tasks
6. Re-evaluate quarterly: Models improve fast — the best choice today may not be best in 3 months. Keep your evaluation pipeline automated

Key Benchmarks to Know

MMLU: Multiple-choice knowledge across 57 subjects. Saturated — most frontier models score >88%, making it less useful for differentiation
GPQA Diamond: PhD-level science questions. Harder than MMLU and still differentiates frontier models well. GPT-5 and Claude Opus 4.5 lead at ~87%
AIME 2024: Real math competition problems. Reasoning models (o3: 91.6%, DeepSeek R1: 86.7%) dominate. Regular models score 50-74%. The biggest gap between reasoning and non-reasoning models
SWE-bench Verified: Resolve real GitHub issues in real codebases. The most practical coding benchmark. Claude Opus 4.5 leads at 80.9% — far ahead of open-source models (~40-50%)
Chatbot Arena (Elo): Human preference ranking: users compare anonymous model responses and vote. 6M+ votes compute Elo ratings. The most reliable "vibes" benchmark — measures what people actually prefer
Custom Benchmarks: Public benchmarks show general capability. Custom benchmarks show if the model works for YOUR task. The model topping MMLU may not be best for your medical chatbot. Always build domain-specific evaluation

Fun Fact: On AIME 2024, reasoning model o3 scores 91.6% — outperforming 99% of human competitors. But on SWE-bench (real code), it scores only 61.2% while Claude Opus 4.5 reaches 80.9%. There is no universal "best" model — only the best model for YOUR task.

Try It Yourself!

Explore the interactive benchmark comparison below to see how different models perform across tasks.

LLM BenchmarksInteractive

Massive Multitask Language Understanding

Tests knowledge across 57 subjects from STEM to humanities. Saturated — most frontier models score >88%

Measures: General knowledge (saturated)

o3Leader

91.2%

GPT-5

90.2%

Claude Opus 4.5

89.8%

Gemini 2.5 Pro

89.5%

Claude Sonnet 4

88.7%

DeepSeek V3

88.5%

Qwen 3

86.2%

DeepSeek R1

85.5%

How to Read Benchmarks

Higher is better: But differences < 2% are usually not significant in practice
Context matters: MMLU tests knowledge, SWE-bench tests real coding. Choose based on your task
No model wins all: o3 leads math, Claude leads coding, GPT-5 leads knowledge. Pick for YOUR task

Benchmark Contamination

Models may have trained on benchmark data, inflating scores. AIME and SWE-bench are harder to contaminate because they use real-world problems. Always test on YOUR data — it is the only benchmark that cannot be gamed.

Best for Code

Claude (SWE-bench)

Best for Knowledge

GPT-5 (GPQA)

Best for Math

o3 / DeepSeek R1

Best Open Source

DeepSeek V3 / Qwen 3

Frequently asked questions

What is an LLM benchmark in simple terms?

A benchmark is a standardized test that measures a model's quality on a fixed set of tasks using a fixed scoring rule. Every model gets the same questions and is graded the same way, so scores become comparable across labs and over time — much like crash tests for cars.

What's the difference between MMLU, GPQA, AIME and SWE-bench?

MMLU is a multiple-choice knowledge test across 57 subjects, now nearly saturated (>88% for frontier models). GPQA Diamond asks PhD-level science questions and still differentiates leaders well. AIME is competition math dominated by reasoning models. SWE-bench Verified is the most practical coding benchmark: the model must resolve real GitHub issues, and the patch only counts if the project's real tests pass.

Why can a model with a high benchmark score still fail on my task?

The main reason is contamination: the test questions (or paraphrases) leaked into training data, so the model recalls answers instead of reasoning. Public benchmarks also measure general ability, not your domain. Build 50–100 gold-standard input/output pairs for your task and evaluate candidates on that private set — it cannot be contaminated.

How do I compare several LLMs for my own project?

Define task categories from your use case, use public benchmarks (GPQA, AIME, SWE-bench) as a first filter, collect 50–100 expert-verified pairs, and run each model on that dataset. Use LLM-as-judge for scale. Measure accuracy, latency and cost, compute ROI (cost per quality point), and re-evaluate roughly every quarter since models change fast.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Evaluate model quality systematically instead of subjective assessment

Without technique

Prompt

We're testing a new model for article summarization. How to evaluate quality?

Response

Try a few articles and see if you like the results. If summaries are good — the model fits.

Tokens:25/30

Time:350ms

Quality:

With Benchmarks

With technique

Prompt

You are an ML engineer. Design a benchmark for evaluating a summarization model. Requirements: 1. Test dataset: how many examples, what categories 2. Metrics: automated + manual 3. Pass/fail criteria: threshold values 4. Process: how to run, how often Context: news article summarization, 500-2000 words → 2-3 sentences.

Response

1. Test dataset: 100 articles

5 categories × 20 articles: politics, tech, sports, economics, science
Golden summaries: written by experts (ground truth)

2. Metrics: Auto: ROUGE-L ≥ 0.35, BERTScore ≥ 0.85, length 40-80 tokens Manual (sample 20%): factual accuracy (1-5), completeness (1-5), readability (1-5)

3. Pass/fail criteria:

ROUGE-L < 0.30 → FAIL
Factual errors > 10% → FAIL
BERTScore < 0.80 → WARNING

4. Process:

CI: auto-metrics on every PR changing prompts
Weekly: manual evaluation of 20 random production requests
Monthly: full run of 100 articles when switching models

👁️"I like it" is not a metric. Need quantitative measures: ROUGE, BERTScore, manual evaluation

🧠Auto-metrics catch regressions in CI, manual ones catch what automation misses (factual accuracy)

✅Threshold values (ROUGE-L < 0.30 = FAIL) turn evaluation from subjective to objective

Tokens:95/240

Time:1600ms

Quality:

Why this works

Systematic benchmark: dataset with golden answers + auto-metrics (ROUGE, BERTScore) + manual evaluation + pass/fail thresholds. Without this, changing model or prompt is a gamble.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Model Selection Fine Tuning

This lesson is part of a structured LLM course.

My Learning Path

The Solution: Evaluate Models Systematically

How scoring actually works

The big pitfall: contamination

Think of it like a test drive evaluation:

1. Define task categories from YOUR use case: What does the model need to do? Summarize reports? Answer medical questions? Write code? Debug existing code?
2. Check public benchmarks (beware contamination): Use GPQA, AIME, SWE-bench as filters — but remember models may have trained on test data. AIME/SWE-bench are harder to game because they use real-world problems
3. Collect 50-100 gold-standard pairs: Build input/output pairs with expert-verified correct answers — your ground truth dataset. This is the only evaluation that cannot be contaminated
4. Score models on YOUR data: Run each candidate on your dataset. Use LLM-as-judge (a stronger model grades weaker ones) for scalable evaluation. Measure accuracy, latency, and cost
5. Compare cost per quality point: A cheaper model at 90% accuracy may beat an expensive one at 95% — calculate the ROI. Reasoning models cost 3-5x more but may be worth it for math/logic tasks
6. Re-evaluate quarterly: Models improve fast — the best choice today may not be best in 3 months. Keep your evaluation pipeline automated

Key Benchmarks to Know

MMLU: Multiple-choice knowledge across 57 subjects. Saturated — most frontier models score >88%, making it less useful for differentiation

GPQA Diamond: PhD-level science questions. Harder than MMLU and still differentiates frontier models well. GPT-5 and Claude Opus 4.5 lead at ~87%

AIME 2024: Real math competition problems. Reasoning models (o3: 91.6%, DeepSeek R1: 86.7%) dominate. Regular models score 50-74%. The biggest gap between reasoning and non-reasoning models

SWE-bench Verified: Resolve real GitHub issues in real codebases. The most practical coding benchmark. Claude Opus 4.5 leads at 80.9% — far ahead of open-source models (~40-50%)

Chatbot Arena (Elo): Human preference ranking: users compare anonymous model responses and vote. 6M+ votes compute Elo ratings. The most reliable "vibes" benchmark — measures what people actually prefer

Custom Benchmarks: Public benchmarks show general capability. Custom benchmarks show if the model works for YOUR task. The model topping MMLU may not be best for your medical chatbot. Always build domain-specific evaluation

Frequently asked questions

What is an LLM benchmark in simple terms?

What's the difference between MMLU, GPQA, AIME and SWE-bench?

Why can a model with a high benchmark score still fail on my task?

How do I compare several LLMs for my own project?