Building Evaluation Datasets
Your evaluation is only as good as your dataset. This article covers the complete lifecycle of building eval datasets: manually curating golden sets, generating synthetic data with stronger models, crafting adversarial examples, avoiding contamination, and determining the right dataset size for statistical significance.
Quick Reference
- →Golden datasets: manually curated, high-quality question-answer pairs — the bedrock of evaluation
- →Synthetic generation uses a stronger model (e.g., GPT-5.4) to create eval data at scale
- →Adversarial examples expose edge cases — typos, ambiguity, prompt injection attempts, boundary conditions
- →Contamination: never let eval data leak into training / prompt examples / few-shot sets
- →For 95% confidence with 5% margin of error, you need ~385 eval samples minimum
- →Refresh eval datasets quarterly with real production queries to combat distribution shift
Golden Datasets: The Bedrock of Evaluation
A golden dataset is a manually curated collection of input-output pairs where the expected output has been verified by domain experts. This is the most trustworthy form of evaluation data because every answer has been human-validated. The downside is cost: creating golden datasets is slow, expensive, and does not scale. But you need them — they are the calibration standard against which all other evaluation methods are benchmarked.
You do not need thousands of golden examples to start. 50 carefully curated examples covering your main use cases give you enough signal to catch major regressions. Grow to 200-500 as your system matures. Quality of annotations matters far more than quantity.
- ▸Include metadata for every example: who annotated it, when, what source material they used
- ▸Categorize examples by type (factual, reasoning, creative, safety) and difficulty to analyze performance slices
- ▸Version your golden datasets in git — they are as important as your code
- ▸Have at least two annotators independently verify each example to catch annotation errors