Evaluating RAG Systems
A production-grade RAG evaluation playbook: two-stage failure diagnosis, retrieval metrics, LLM-as-Judge generation scoring, the updated RAGAS v0.4 API, golden dataset construction, CI threshold gates, and online monitoring.
Quick Reference
- →Evaluate retrieval and generation separately — they fail for different reasons, requiring different fixes
- →Build a golden dataset of 50-100 queries before tuning parameters — you need a baseline before you can optimize
- →Retrieval metrics (precision@k, recall@k, MRR, NDCG) require labeled data; generation metrics (faithfulness, relevance) can be scored with LLM-as-Judge
- →RAGAS v0.4 uses ragas.metrics.collections — the old ragas.metrics imports are deprecated and will break in v1.0
- →Set per-metric CI thresholds and block merges that cause regressions — evaluation only works if it gates changes
- →Monitor 5–10% of production queries with async LLM-as-Judge to detect corpus drift before users report it
- →Evaluation itself can lie: judge model bias, distribution mismatch, and golden dataset rot are the three most common reasons eval says good while users say bad
When (and When Not) to Formally Evaluate
Formal RAG evaluation — golden datasets, metric computation, CI gates — has real setup cost. Before building the harness, answer one question: what happens when this system gives a wrong answer? If the answer is 'a team member notices and corrects it' or 'it's a prototype and wrong answers don't matter yet,' you don't need formal eval. If the answer is 'a customer is misled' or 'a compliance report is wrong,' you do.
| Situation | Recommendation |
|---|---|
| < 50 documents, internal team only | Vibes-based is fine — just run 10 queries manually |
| Comparing two retrieval strategies | Need eval: without metrics you cannot know which is better |
| Changing chunking, embedding, or index | Need eval: these changes affect the entire pipeline |
| Public-facing product or customer-facing bot | Need eval + CI gates: regressions will reach users |
| Medical, legal, financial, compliance domain | Need eval + CI gates + online monitoring: wrong answers have consequences |
Running 5 test queries and eyeballing the answers is not evaluation — it's confirmation bias with extra steps. You will unconsciously choose queries you expect to work. A 50-query golden dataset with computed metrics will surface failure modes that manual spot-checking misses every time.