Evaluating RAG Systems
Measuring RAG quality systematically: retrieval metrics (precision, recall, MRR, NDCG), generation metrics (faithfulness, relevance), the RAGAS framework, and building golden evaluation datasets.
Quick Reference
- →Evaluate retrieval and generation separately — bad retrieval guarantees bad answers
- →Retrieval metrics: precision@k, recall@k, MRR, NDCG measure how well you find relevant documents
- →Generation metrics: faithfulness (no hallucination), answer relevance, context utilization
- →RAGAS: automated RAG evaluation framework that uses LLMs to score without human labels
- →Golden dataset: 50-100 queries with known relevant documents and expected answers
Evaluating Retrieval vs Generation
A RAG system can fail at retrieval (wrong documents found), at generation (right documents but wrong answer), or both. You must evaluate each stage independently to know where to invest optimization effort. If your retrieval is perfect but answers are wrong, you have a generation problem (prompt engineering, model choice). If your retrieval is poor, no amount of prompt engineering will fix the answers.
Level 1: Retrieval evaluation — are the right documents being found? Measure with precision@k, recall@k, MRR, NDCG. Level 2: Generation evaluation — given the right documents, is the answer correct? Measure with faithfulness, answer relevance, completeness. Level 3: End-to-end evaluation — does the system answer user questions correctly? Measure with human ratings or LLM-as-judge.
| Wrong Retrieval | Correct Generation | Result |
|---|---|---|
| Yes | N/A | Wrong answer (retrieval problem) |
| No | Yes | Correct answer |
| No | No | Wrong answer (generation problem) |
| Partial | Yes | Incomplete but correct answer |