Advanced RAG/RAG in Production
Advanced11 min

Evaluating RAG Systems

Measuring RAG quality systematically: retrieval metrics (precision, recall, MRR, NDCG), generation metrics (faithfulness, relevance), the RAGAS framework, and building golden evaluation datasets.

Quick Reference

  • Evaluate retrieval and generation separately — bad retrieval guarantees bad answers
  • Retrieval metrics: precision@k, recall@k, MRR, NDCG measure how well you find relevant documents
  • Generation metrics: faithfulness (no hallucination), answer relevance, context utilization
  • RAGAS: automated RAG evaluation framework that uses LLMs to score without human labels
  • Golden dataset: 50-100 queries with known relevant documents and expected answers

Evaluating Retrieval vs Generation

A RAG system can fail at retrieval (wrong documents found), at generation (right documents but wrong answer), or both. You must evaluate each stage independently to know where to invest optimization effort. If your retrieval is perfect but answers are wrong, you have a generation problem (prompt engineering, model choice). If your retrieval is poor, no amount of prompt engineering will fix the answers.

The evaluation stack

Level 1: Retrieval evaluation — are the right documents being found? Measure with precision@k, recall@k, MRR, NDCG. Level 2: Generation evaluation — given the right documents, is the answer correct? Measure with faithfulness, answer relevance, completeness. Level 3: End-to-end evaluation — does the system answer user questions correctly? Measure with human ratings or LLM-as-judge.

Wrong RetrievalCorrect GenerationResult
YesN/AWrong answer (retrieval problem)
NoYesCorrect answer
NoNoWrong answer (generation problem)
PartialYesIncomplete but correct answer