LLM-as-Judge

Using one LLM to evaluate another's output is the most scalable automated evaluation technique available. This article covers the complete LLM-as-judge pipeline: rubric design, position bias mitigation, calibration against human judgment, judge model selection, and production-ready implementation.

Quick Reference

→LLM-as-judge: use a strong model (GPT-5.4, Claude) to score a weaker model's output
→Rubric design is everything — vague criteria produce inconsistent scores
→Position bias: models prefer the first option in A/B comparisons — always randomize order
→Calibrate judges against human annotations — track judge-human agreement rate
→Use structured output (JSON) for judge responses to enable reliable parsing
→Judge costs are evaluation costs — budget for them explicitly

The Core Concept: Models Evaluating Models

Human evaluation is the gold standard but does not scale. You cannot have humans review 10,000 responses per day. LLM-as-judge bridges this gap: you use a strong model (typically the most capable available) to evaluate the output of another model. The judge model reads the input, the system's output, optionally a reference answer, and scores the output against a rubric. This approach scales to millions of evaluations per day at a fraction of the cost of human review.

When LLM-as-judge works well

LLM-as-judge works best when: (1) the evaluation criteria can be articulated in a rubric, (2) the judge model is significantly more capable than the system being evaluated, and (3) you have calibrated the judge against human annotations. It works poorly for highly subjective tasks, domain-specific expertise the judge lacks, or when evaluating frontier models (you cannot judge GPT-5.4 with GPT-5.4 reliably).

Core LLM-as-judge function with structured output

Rubric Design: The Foundation of Reliable Judging

The quality of your LLM judge is determined almost entirely by the quality of your rubric. A vague rubric ('rate the quality from 1-5') produces inconsistent, unreliable scores. A specific rubric with clear criteria, anchored scoring scales, and examples produces scores that correlate with human judgment.

Position Bias: Order Matters, So Randomize It

When asking an LLM to compare two responses (A vs B), models systematically prefer the response presented first. This is position bias, and it is well-documented across all major models. In pairwise evaluation, this can flip your results entirely — the 'winner' is whichever response happened to be shown first.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.