Evaluating RAG Quality

RAG systems have two failure points: retrieval can return wrong documents, and generation can hallucinate despite correct documents. Evaluating RAG quality requires separating these concerns: measure retrieval quality independently, measure generation quality independently, then measure end-to-end correctness. This article builds a complete RAG evaluation pipeline with metrics from the RAGAS framework and custom scoring.

Quick Reference

→Dual evaluation: always measure retrieval quality AND generation quality separately
→Retrieval metrics: precision@k, recall@k, MRR (Mean Reciprocal Rank) for chunk relevance
→Generation metrics: faithfulness (grounded in retrieved docs), answer relevance, completeness
→End-to-end metric: does the user get the right answer regardless of retrieval/generation details?
→Context utilization: does the generator actually use the retrieved chunks, or ignore them?
→RAGAS framework provides automated RAG evaluation without reference answers

Dual Evaluation: Separating Retrieval from Generation

The most common RAG evaluation mistake is evaluating only the final answer. When the answer is wrong, you need to know why: did retrieval fail (wrong documents) or did generation fail (wrong answer despite right documents)? Without separating these, you cannot fix the problem. A retrieval failure needs better embeddings, chunking, or indexing. A generation failure needs better prompts or a better model.

The RAG evaluation matrix

There are four possible outcomes: (1) Good retrieval + good generation = correct answer, (2) Good retrieval + bad generation = hallucination problem, (3) Bad retrieval + good generation = retrieval problem (model cannot answer what it cannot see), (4) Bad retrieval + bad generation = both need work. Separate evaluation tells you which quadrant you are in.

Data model for RAG evaluation with separate retrieval and generation tracking

Retrieval Metrics: Are You Getting the Right Documents?

Retrieval evaluation is a classic information retrieval problem: given a query, did the system return the relevant documents? The key metrics are precision (what fraction of retrieved docs are relevant), recall (what fraction of relevant docs were retrieved), and MRR (how high is the first relevant result ranked).

Generation Metrics: Is the Answer Faithful and Complete?

Given that retrieval returned the right documents, did the generator produce a faithful, complete, and relevant answer? Generation evaluation is where RAG-specific concerns diverge most from generic LLM evaluation. The critical metric is faithfulness: every claim in the answer must be supported by the retrieved context. An answer that is correct but not grounded in the context is a hallucination, even if it happens to be factually true.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.