LLM Foundations/The Model Landscape
Intermediate9 min

Reading Benchmarks Critically

How to interpret LLM benchmarks without being misled. Covers major benchmarks (MMLU, HumanEval, MATH, Arena Elo), what they actually test, benchmark contamination, and how to build your own task-specific benchmark that actually matters.

Quick Reference

  • MMLU: 57 academic subjects, multiple choice -- tests breadth of knowledge, not depth
  • HumanEval: 164 Python coding problems -- tests basic code generation, not real-world software engineering
  • MATH: Competition math problems -- tests formal reasoning, far from typical AI engineering tasks
  • Chatbot Arena Elo: human preference rankings from blind comparisons -- most ecologically valid
  • Benchmark contamination (training on test data) inflates scores -- treat all numbers skeptically
  • Your own task-specific benchmark is more valuable than any public benchmark

Major Benchmarks Explained

BenchmarkWhat it testsFormatTop scores (early 2025)
MMLUAcademic knowledge across 57 subjectsMultiple choice (4 options)GPT-5.4: ~88%, Claude Sonnet 4.6: ~89%, Gemini Ultra: ~90%
HumanEvalPython code generation (164 problems)Write function from docstringGPT-5.4: ~92%, Claude Sonnet 4.6: ~93%, DeepSeek V3.2: ~90%
MATHCompetition mathematics (12.5K problems)Multi-step math with proofo3: ~96%, Claude Sonnet 4.6: ~78%, GPT-5.4: ~76%
GSM8KGrade-school math word problemsArithmetic word problemsMost frontier models: >95%
GPQAGraduate-level science questionsMultiple choice, expert-levelo3: ~80%, GPT-5.4: ~55%
Chatbot Arena EloHuman preference via blind comparisonUsers compare two model outputsMost trusted ranking of overall quality
Chatbot Arena is the gold standard

LMSYS Chatbot Arena collects blind A/B comparisons from real users. Because the model identities are hidden, it avoids brand bias. The Elo ratings correlate well with real-world usefulness. As of early 2026, the top Elo models are o3, Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro -- all within a narrow band.

What Benchmarks Actually Test (and Don't Test)

Every benchmark is a lossy proxy for real-world capability. Understanding what they actually measure -- and what they miss -- is essential for interpreting them correctly.

BenchmarkWhat it actually testsWhat it misses
MMLUAbility to recall academic facts in multiple-choice formatReasoning, nuance, application of knowledge, open-ended tasks
HumanEvalAbility to write short, self-contained Python functionsDebugging, large codebase understanding, testing, architecture
MATHFormal mathematical reasoning with known solutionsReal-world quantitative reasoning, estimation, ambiguity
GSM8KBasic arithmetic in word problem formatSaturated -- most models score >95%, not discriminating
Arena EloHuman preference for conversational qualityTask-specific performance, consistency, safety
  • Multiple-choice benchmarks reward test-taking strategies over genuine understanding
  • Coding benchmarks test isolated function generation, not real software engineering (architecture, debugging, testing)
  • Most benchmarks test single-turn performance -- they don't measure multi-turn conversation quality
  • No major benchmark measures instruction-following consistency across diverse, complex prompts
  • Latency, cost, and reliability are never measured in capability benchmarks
The benchmark leaderboard trap

A 2% improvement on MMLU between Model A and Model B tells you almost nothing about which is better for your production use case. The margin of error on these benchmarks is often larger than the differences between top models. Never make model selection decisions based solely on benchmark deltas.

Benchmark Contamination

Benchmark contamination occurs when test data (or data very similar to it) appears in the model's training set. This inflates scores without improving real-world capability. It is one of the most significant problems in LLM evaluation today.

  • MMLU questions have leaked onto the public internet and are likely in many training datasets
  • HumanEval solutions are published on GitHub -- models may have memorized them rather than learning to code
  • Some model providers have been caught training on benchmark data (intentionally or through broad web scraping)
  • Contamination is hard to detect: exact match is obvious, but paraphrased or structurally similar questions are not
  • The more popular a benchmark becomes, the more likely it is contaminated in future training runs
Private benchmarks resist contamination

The most reliable benchmarks are ones you build yourself and never publish. Your test data cannot be contaminated if it has never been on the internet. This is why building a task-specific private benchmark is the single most valuable evaluation investment you can make.

Why Your Use Case Benchmark Matters More

Public benchmarks measure general capabilities. Your application needs specific capabilities. A model that scores 90% on MMLU might score 60% on your specific extraction task, while a cheaper model might score 85%. The only way to know is to build your own benchmark.

  • Your benchmark should represent your actual production distribution -- not a curated set of interesting examples
  • Include easy cases (80% of production traffic), hard cases (15%), and adversarial edge cases (5%)
  • Define clear, measurable success criteria: exact match, semantic similarity, format compliance, etc.
  • 50-100 labeled examples is enough to make confident model comparisons
  • Run evaluations after every prompt change, model update, or system modification
Building a simple task-specific benchmark

How to Build a Task-Specific Benchmark

Building a good benchmark is an iterative process. Start simple, refine as you learn what matters.

  • Step 1: Collect 20 real production examples and manually label the correct output
  • Step 2: Define your judge function -- how do you determine if a response is correct?
  • Step 3: Run your current model and identify failure patterns
  • Step 4: Add 30-80 more examples, focusing on discovered failure modes
  • Step 5: Split into categories (easy/hard/edge) and tag with failure types
  • Step 6: Run against 3-5 candidate models and compare
  • Step 7: Store results with timestamps -- track quality over time as you iterate
LLM-as-judge

For open-ended tasks where exact match is not possible, use a strong LLM (GPT-5.4, Claude Sonnet 4.6) as a judge. Provide the input, expected output, and actual output, then ask the judge to score on specific criteria (accuracy, completeness, format). LLM judges correlate well with human judgment when given clear rubrics. But always validate your judge against human labels first.

Don't use the same model as judge and candidate

If you evaluate GPT-5.4 outputs using GPT-5.4 as the judge, you get inflated scores because the model is biased toward its own style. Always use a different model (or human) as the judge.

Best Practices

Best Practices

Do

  • Build your own task-specific benchmark with 50-100 labeled examples from production data
  • Use Chatbot Arena Elo ratings as the most reliable general-purpose model ranking
  • Test every model candidate with your actual data before making a selection decision
  • Track benchmark results over time to detect regressions from prompt or model changes
  • Use LLM-as-judge for open-ended tasks, but validate against human labels first

Don’t

  • Don't select models based on public benchmark scores alone
  • Don't trust small differences (1-3%) between models on any benchmark -- they are within noise
  • Don't assume a model that scores well on coding benchmarks will be good at your specific coding task
  • Don't publish your private benchmark data -- it will eventually contaminate training sets
  • Don't forget to include edge cases and adversarial examples in your benchmark

Key Takeaways

  • Public benchmarks are lossy proxies: MMLU tests recall, HumanEval tests toy coding, neither tests your use case.
  • Chatbot Arena Elo is the most ecologically valid general ranking because it uses blind human comparisons.
  • Benchmark contamination inflates scores -- models may have memorized test data during training.
  • Your own task-specific benchmark with 50-100 production examples is more valuable than any public benchmark.
  • Always compare models on your data, across multiple categories (easy, hard, edge case), before making decisions.

Video on this topic

LLM benchmarks are lying to you (and what to do instead)

instagram