Evaluation & Quality/Human Evaluation & Experimentation
Advanced10 min

Designing Human Evaluation

Automated evaluation scales but misses nuance. Human evaluation catches what LLM judges cannot — subtle quality issues, user experience problems, and domain-specific errors. This article covers the complete design of human evaluation: annotation guidelines, inter-annotator agreement, quality control, scaling strategies, and cost management.

Quick Reference

  • Annotation guidelines must be specific, with examples of each score level and edge case handling
  • Inter-annotator agreement (Cohen's kappa > 0.6) validates that your guidelines are clear enough
  • Use 2+ annotators per example and resolve disagreements through discussion, not majority vote
  • Attention checks (known-answer items) catch careless annotators in crowdsourced evaluation
  • Expert review is 5-10x more expensive but catches domain errors that crowdworkers miss
  • Budget 30-60 minutes of annotation time per 100 examples (after training)

Annotation Guidelines: The Foundation of Reliable Human Eval

The quality of your human evaluation is entirely determined by the quality of your annotation guidelines. Vague guidelines produce inconsistent annotations. Specific guidelines with examples, edge case handling, and scoring anchors produce reproducible results. Invest more time in guideline design than in actual annotation — it pays for itself in data quality.

The five-part annotation guideline

Every annotation guideline should include: (1) Task description — what you are evaluating and why, (2) Scoring rubric — each dimension with a numerical scale, (3) Anchored examples — at least 2 examples per score level showing what each score looks like, (4) Edge case instructions — explicit handling for ambiguous, incomplete, or unusual responses, (5) Disqualification criteria — when to flag an example as 'cannot annotate' rather than guessing.

Structured annotation guideline with anchored examples for each score level