Reference-Based vs Reference-Free Evaluation
Should you compare outputs against a gold standard, or evaluate them in isolation? This article covers both paradigms: traditional reference-based metrics (BLEU, ROUGE, BERTScore), reference-free model-based evaluation, and when to use each approach — with practical code for both.
Quick Reference
- →Reference-based: compare output to a known-good answer (BLEU, ROUGE, BERTScore)
- →Reference-free: evaluate output quality without a gold standard (LLM judge, classifiers)
- →BLEU/ROUGE measure surface overlap — they miss semantic equivalence and penalize valid paraphrases
- →BERTScore uses embeddings for semantic similarity — much better than n-gram overlap
- →Reference-free is essential when no single correct answer exists (creative, open-ended tasks)
- →Best practice: combine both — reference-based for grounding, reference-free for quality
Reference-Based Metrics: Comparing to a Gold Standard
Reference-based evaluation compares the system's output against a known-good reference answer. This is the oldest paradigm in NLP evaluation, originating from machine translation (BLEU) and summarization (ROUGE). The appeal is objectivity: given a reference, the score is deterministic and cheap to compute. The problem is that language is flexible — many valid answers exist for any question, and n-gram overlap metrics penalize valid paraphrases.
| Metric | How it works | Strengths | Weaknesses |
|---|---|---|---|
| BLEU | Precision of n-gram overlap between output and reference | Fast, deterministic, well-understood | Penalizes valid paraphrases; order-sensitive; poor for short texts |
| ROUGE-L | Longest common subsequence between output and reference | Captures ordering; good for summarization | Still surface-level; misses semantic equivalence |
| BERTScore | Cosine similarity of contextual embeddings (token-level) | Captures semantic similarity; handles paraphrases | Requires a model; slower than n-gram metrics; less interpretable |
| Exact Match | Binary: does the output exactly match the reference? | Simple, unambiguous, fast | Too strict for most tasks; useless for free-form text |
| F1 (token-level) | Token overlap between output and reference (precision + recall) | Balances precision and recall; good for extractive tasks | Ignores word order; treats all tokens equally |
These metrics catch obvious failures (completely wrong answers, empty responses) but miss subtle quality differences. A response that paraphrases the reference perfectly will score low on BLEU but high on BERTScore. Always combine surface-level metrics with semantic metrics.