Evaluation & Quality/Agent-Specific Evaluation
Advanced10 min

Evaluating RAG Quality

RAG systems have two failure points: retrieval can return wrong documents, and generation can hallucinate despite correct documents. Evaluating RAG quality requires separating these concerns: measure retrieval quality independently, measure generation quality independently, then measure end-to-end correctness. This article builds a complete RAG evaluation pipeline with metrics from the RAGAS framework and custom scoring.

Quick Reference

  • Dual evaluation: always measure retrieval quality AND generation quality separately
  • Retrieval metrics: precision@k, recall@k, MRR (Mean Reciprocal Rank) for chunk relevance
  • Generation metrics: faithfulness (grounded in retrieved docs), answer relevance, completeness
  • End-to-end metric: does the user get the right answer regardless of retrieval/generation details?
  • Context utilization: does the generator actually use the retrieved chunks, or ignore them?
  • RAGAS framework provides automated RAG evaluation without reference answers

Dual Evaluation: Separating Retrieval from Generation

The most common RAG evaluation mistake is evaluating only the final answer. When the answer is wrong, you need to know why: did retrieval fail (wrong documents) or did generation fail (wrong answer despite right documents)? Without separating these, you cannot fix the problem. A retrieval failure needs better embeddings, chunking, or indexing. A generation failure needs better prompts or a better model.

The RAG evaluation matrix

There are four possible outcomes: (1) Good retrieval + good generation = correct answer, (2) Good retrieval + bad generation = hallucination problem, (3) Bad retrieval + good generation = retrieval problem (model cannot answer what it cannot see), (4) Bad retrieval + bad generation = both need work. Separate evaluation tells you which quadrant you are in.

Data model for RAG evaluation with separate retrieval and generation tracking