Evaluating RAG Quality
RAG systems have two failure points: retrieval can return wrong documents, and generation can hallucinate despite correct documents. Evaluating RAG quality requires separating these concerns: measure retrieval quality independently, measure generation quality independently, then measure end-to-end correctness. This article builds a complete RAG evaluation pipeline with metrics from the RAGAS framework and custom scoring.
Quick Reference
- →Dual evaluation: always measure retrieval quality AND generation quality separately
- →Retrieval metrics: precision@k, recall@k, MRR (Mean Reciprocal Rank) for chunk relevance
- →Generation metrics: faithfulness (grounded in retrieved docs), answer relevance, completeness
- →End-to-end metric: does the user get the right answer regardless of retrieval/generation details?
- →Context utilization: does the generator actually use the retrieved chunks, or ignore them?
- →RAGAS framework provides automated RAG evaluation without reference answers
Dual Evaluation: Separating Retrieval from Generation
The most common RAG evaluation mistake is evaluating only the final answer. When the answer is wrong, you need to know why: did retrieval fail (wrong documents) or did generation fail (wrong answer despite right documents)? Without separating these, you cannot fix the problem. A retrieval failure needs better embeddings, chunking, or indexing. A generation failure needs better prompts or a better model.
There are four possible outcomes: (1) Good retrieval + good generation = correct answer, (2) Good retrieval + bad generation = hallucination problem, (3) Bad retrieval + good generation = retrieval problem (model cannot answer what it cannot see), (4) Bad retrieval + bad generation = both need work. Separate evaluation tells you which quadrant you are in.