Reading Benchmarks Critically
How to interpret LLM benchmarks without being misled. Covers major benchmarks (MMLU, HumanEval, MATH, Arena Elo), what they actually test, benchmark contamination, and how to build your own task-specific benchmark that actually matters.
Quick Reference
- →MMLU: 57 academic subjects, multiple choice -- tests breadth of knowledge, not depth
- →HumanEval: 164 Python coding problems -- tests basic code generation, not real-world software engineering
- →MATH: Competition math problems -- tests formal reasoning, far from typical AI engineering tasks
- →Chatbot Arena Elo: human preference rankings from blind comparisons -- most ecologically valid
- →Benchmark contamination (training on test data) inflates scores -- treat all numbers skeptically
- →Your own task-specific benchmark is more valuable than any public benchmark
In this article
Major Benchmarks Explained
| Benchmark | What it tests | Format | Top scores (early 2025) |
|---|---|---|---|
| MMLU | Academic knowledge across 57 subjects | Multiple choice (4 options) | GPT-5.4: ~88%, Claude Sonnet 4.6: ~89%, Gemini Ultra: ~90% |
| HumanEval | Python code generation (164 problems) | Write function from docstring | GPT-5.4: ~92%, Claude Sonnet 4.6: ~93%, DeepSeek V3.2: ~90% |
| MATH | Competition mathematics (12.5K problems) | Multi-step math with proof | o3: ~96%, Claude Sonnet 4.6: ~78%, GPT-5.4: ~76% |
| GSM8K | Grade-school math word problems | Arithmetic word problems | Most frontier models: >95% |
| GPQA | Graduate-level science questions | Multiple choice, expert-level | o3: ~80%, GPT-5.4: ~55% |
| Chatbot Arena Elo | Human preference via blind comparison | Users compare two model outputs | Most trusted ranking of overall quality |
LMSYS Chatbot Arena collects blind A/B comparisons from real users. Because the model identities are hidden, it avoids brand bias. The Elo ratings correlate well with real-world usefulness. As of early 2026, the top Elo models are o3, Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro -- all within a narrow band.
What Benchmarks Actually Test (and Don't Test)
Every benchmark is a lossy proxy for real-world capability. Understanding what they actually measure -- and what they miss -- is essential for interpreting them correctly.
| Benchmark | What it actually tests | What it misses |
|---|---|---|
| MMLU | Ability to recall academic facts in multiple-choice format | Reasoning, nuance, application of knowledge, open-ended tasks |
| HumanEval | Ability to write short, self-contained Python functions | Debugging, large codebase understanding, testing, architecture |
| MATH | Formal mathematical reasoning with known solutions | Real-world quantitative reasoning, estimation, ambiguity |
| GSM8K | Basic arithmetic in word problem format | Saturated -- most models score >95%, not discriminating |
| Arena Elo | Human preference for conversational quality | Task-specific performance, consistency, safety |
- ▸Multiple-choice benchmarks reward test-taking strategies over genuine understanding
- ▸Coding benchmarks test isolated function generation, not real software engineering (architecture, debugging, testing)
- ▸Most benchmarks test single-turn performance -- they don't measure multi-turn conversation quality
- ▸No major benchmark measures instruction-following consistency across diverse, complex prompts
- ▸Latency, cost, and reliability are never measured in capability benchmarks
A 2% improvement on MMLU between Model A and Model B tells you almost nothing about which is better for your production use case. The margin of error on these benchmarks is often larger than the differences between top models. Never make model selection decisions based solely on benchmark deltas.
Benchmark Contamination
Benchmark contamination occurs when test data (or data very similar to it) appears in the model's training set. This inflates scores without improving real-world capability. It is one of the most significant problems in LLM evaluation today.
- ▸MMLU questions have leaked onto the public internet and are likely in many training datasets
- ▸HumanEval solutions are published on GitHub -- models may have memorized them rather than learning to code
- ▸Some model providers have been caught training on benchmark data (intentionally or through broad web scraping)
- ▸Contamination is hard to detect: exact match is obvious, but paraphrased or structurally similar questions are not
- ▸The more popular a benchmark becomes, the more likely it is contaminated in future training runs
The most reliable benchmarks are ones you build yourself and never publish. Your test data cannot be contaminated if it has never been on the internet. This is why building a task-specific private benchmark is the single most valuable evaluation investment you can make.
Why Your Use Case Benchmark Matters More
Public benchmarks measure general capabilities. Your application needs specific capabilities. A model that scores 90% on MMLU might score 60% on your specific extraction task, while a cheaper model might score 85%. The only way to know is to build your own benchmark.
- ▸Your benchmark should represent your actual production distribution -- not a curated set of interesting examples
- ▸Include easy cases (80% of production traffic), hard cases (15%), and adversarial edge cases (5%)
- ▸Define clear, measurable success criteria: exact match, semantic similarity, format compliance, etc.
- ▸50-100 labeled examples is enough to make confident model comparisons
- ▸Run evaluations after every prompt change, model update, or system modification
How to Build a Task-Specific Benchmark
Building a good benchmark is an iterative process. Start simple, refine as you learn what matters.
- ▸Step 1: Collect 20 real production examples and manually label the correct output
- ▸Step 2: Define your judge function -- how do you determine if a response is correct?
- ▸Step 3: Run your current model and identify failure patterns
- ▸Step 4: Add 30-80 more examples, focusing on discovered failure modes
- ▸Step 5: Split into categories (easy/hard/edge) and tag with failure types
- ▸Step 6: Run against 3-5 candidate models and compare
- ▸Step 7: Store results with timestamps -- track quality over time as you iterate
For open-ended tasks where exact match is not possible, use a strong LLM (GPT-5.4, Claude Sonnet 4.6) as a judge. Provide the input, expected output, and actual output, then ask the judge to score on specific criteria (accuracy, completeness, format). LLM judges correlate well with human judgment when given clear rubrics. But always validate your judge against human labels first.
If you evaluate GPT-5.4 outputs using GPT-5.4 as the judge, you get inflated scores because the model is biased toward its own style. Always use a different model (or human) as the judge.
Best Practices
Do
- ✓Build your own task-specific benchmark with 50-100 labeled examples from production data
- ✓Use Chatbot Arena Elo ratings as the most reliable general-purpose model ranking
- ✓Test every model candidate with your actual data before making a selection decision
- ✓Track benchmark results over time to detect regressions from prompt or model changes
- ✓Use LLM-as-judge for open-ended tasks, but validate against human labels first
Don’t
- ✗Don't select models based on public benchmark scores alone
- ✗Don't trust small differences (1-3%) between models on any benchmark -- they are within noise
- ✗Don't assume a model that scores well on coding benchmarks will be good at your specific coding task
- ✗Don't publish your private benchmark data -- it will eventually contaminate training sets
- ✗Don't forget to include edge cases and adversarial examples in your benchmark
Key Takeaways
- ✓Public benchmarks are lossy proxies: MMLU tests recall, HumanEval tests toy coding, neither tests your use case.
- ✓Chatbot Arena Elo is the most ecologically valid general ranking because it uses blind human comparisons.
- ✓Benchmark contamination inflates scores -- models may have memorized test data during training.
- ✓Your own task-specific benchmark with 50-100 production examples is more valuable than any public benchmark.
- ✓Always compare models on your data, across multiple categories (easy, hard, edge case), before making decisions.
Video on this topic
LLM benchmarks are lying to you (and what to do instead)