Reference-Based vs Reference-Free Evaluation

Should you compare outputs against a gold standard, or evaluate them in isolation? This article covers both paradigms: traditional reference-based metrics (BLEU, ROUGE, BERTScore), reference-free model-based evaluation, and when to use each approach — with practical code for both.

Quick Reference

→Reference-based: compare output to a known-good answer (BLEU, ROUGE, BERTScore)
→Reference-free: evaluate output quality without a gold standard (LLM judge, classifiers)
→BLEU/ROUGE measure surface overlap — they miss semantic equivalence and penalize valid paraphrases
→BERTScore uses embeddings for semantic similarity — much better than n-gram overlap
→Reference-free is essential when no single correct answer exists (creative, open-ended tasks)
→Best practice: combine both — reference-based for grounding, reference-free for quality

Reference-Based Metrics: Comparing to a Gold Standard

Reference-based evaluation compares the system's output against a known-good reference answer. This is the oldest paradigm in NLP evaluation, originating from machine translation (BLEU) and summarization (ROUGE). The appeal is objectivity: given a reference, the score is deterministic and cheap to compute. The problem is that language is flexible — many valid answers exist for any question, and n-gram overlap metrics penalize valid paraphrases.

Metric	How it works	Strengths	Weaknesses
BLEU	Precision of n-gram overlap between output and reference	Fast, deterministic, well-understood	Penalizes valid paraphrases; order-sensitive; poor for short texts
ROUGE-L	Longest common subsequence between output and reference	Captures ordering; good for summarization	Still surface-level; misses semantic equivalence
BERTScore	Cosine similarity of contextual embeddings (token-level)	Captures semantic similarity; handles paraphrases	Requires a model; slower than n-gram metrics; less interpretable
Exact Match	Binary: does the output exactly match the reference?	Simple, unambiguous, fast	Too strict for most tasks; useless for free-form text
F1 (token-level)	Token overlap between output and reference (precision + recall)	Balances precision and recall; good for extractive tasks	Ignores word order; treats all tokens equally

Complete reference-based evaluation with BLEU, ROUGE, BERTScore, and token F1

BLEU and ROUGE are necessary but not sufficient

These metrics catch obvious failures (completely wrong answers, empty responses) but miss subtle quality differences. A response that paraphrases the reference perfectly will score low on BLEU but high on BERTScore. Always combine surface-level metrics with semantic metrics.

Reference-Free Evaluation: Judging Without Ground Truth

Many real-world tasks have no single correct answer. How do you evaluate a creative writing assistant? A brainstorming tool? A conversational agent? Reference-free evaluation judges the output on its own merits — coherence, relevance, fluency, safety — without comparing to a gold standard. LLM-as-judge is the most powerful reference-free method, but simpler approaches also work for specific dimensions.

When to Use Each Approach

Scenario	Best approach	Why
Extractive Q&A (answer is a specific fact)	Reference-based (exact match + F1)	Single correct answer exists; surface metrics work well
Summarization	Both: ROUGE + faithfulness judge	ROUGE checks coverage; judge checks no hallucination
Open-ended conversation	Reference-free (LLM judge)	No single correct answer; quality is subjective
Code generation	Reference-based (execution) + reference-free (quality)	Tests check correctness; judge checks readability and idioms
Translation	Reference-based (BLEU, BERTScore)	Reference translations exist; semantic similarity matters
Creative writing	Reference-free (LLM judge with creativity rubric)	No correct answer; creativity and engagement matter
Classification	Reference-based (precision, recall, F1)	Labels are known; standard classification metrics apply

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.