LLM-as-Judge
Using one LLM to evaluate another's output is the most scalable automated evaluation technique available. This article covers the complete LLM-as-judge pipeline: rubric design, position bias mitigation, calibration against human judgment, judge model selection, and production-ready implementation.
Quick Reference
- →LLM-as-judge: use a strong model (GPT-5.4, Claude) to score a weaker model's output
- →Rubric design is everything — vague criteria produce inconsistent scores
- →Position bias: models prefer the first option in A/B comparisons — always randomize order
- →Calibrate judges against human annotations — track judge-human agreement rate
- →Use structured output (JSON) for judge responses to enable reliable parsing
- →Judge costs are evaluation costs — budget for them explicitly
The Core Concept: Models Evaluating Models
Human evaluation is the gold standard but does not scale. You cannot have humans review 10,000 responses per day. LLM-as-judge bridges this gap: you use a strong model (typically the most capable available) to evaluate the output of another model. The judge model reads the input, the system's output, optionally a reference answer, and scores the output against a rubric. This approach scales to millions of evaluations per day at a fraction of the cost of human review.
LLM-as-judge works best when: (1) the evaluation criteria can be articulated in a rubric, (2) the judge model is significantly more capable than the system being evaluated, and (3) you have calibrated the judge against human annotations. It works poorly for highly subjective tasks, domain-specific expertise the judge lacks, or when evaluating frontier models (you cannot judge GPT-5.4 with GPT-5.4 reliably).