Evaluating RAG Systems

A production-grade RAG evaluation playbook: two-stage failure diagnosis, retrieval metrics, LLM-as-Judge generation scoring, the updated RAGAS v0.4 API, golden dataset construction, CI threshold gates, and online monitoring.

Quick Reference

→Evaluate retrieval and generation separately — they fail for different reasons, requiring different fixes
→Build a golden dataset of 50-100 queries before tuning parameters — you need a baseline before you can optimize
→Retrieval metrics (precision@k, recall@k, MRR, NDCG) require labeled data; generation metrics (faithfulness, relevance) can be scored with LLM-as-Judge
→RAGAS v0.4 uses ragas.metrics.collections — the old ragas.metrics imports are deprecated and will break in v1.0
→Set per-metric CI thresholds and block merges that cause regressions — evaluation only works if it gates changes
→Monitor 5–10% of production queries with async LLM-as-Judge to detect corpus drift before users report it
→Evaluation itself can lie: judge model bias, distribution mismatch, and golden dataset rot are the three most common reasons eval says good while users say bad

When (and When Not) to Formally Evaluate

Formal RAG evaluation — golden datasets, metric computation, CI gates — has real setup cost. Before building the harness, answer one question: what happens when this system gives a wrong answer? If the answer is 'a team member notices and corrects it' or 'it's a prototype and wrong answers don't matter yet,' you don't need formal eval. If the answer is 'a customer is misled' or 'a compliance report is wrong,' you do.

Situation	Recommendation
< 50 documents, internal team only	Vibes-based is fine — just run 10 queries manually
Comparing two retrieval strategies	Need eval: without metrics you cannot know which is better
Changing chunking, embedding, or index	Need eval: these changes affect the entire pipeline
Public-facing product or customer-facing bot	Need eval + CI gates: regressions will reach users
Medical, legal, financial, compliance domain	Need eval + CI gates + online monitoring: wrong answers have consequences

The vibes trap

Running 5 test queries and eyeballing the answers is not evaluation — it's confirmation bias with extra steps. You will unconsciously choose queries you expect to work. A 50-query golden dataset with computed metrics will surface failure modes that manual spot-checking misses every time.

The Two-Stage Mental Model

A RAG pipeline fails in one of two places: retrieval (wrong documents found) or generation (right documents, wrong answer). These require different fixes. If retrieval is failing, no amount of prompt engineering will help — you're handing the LLM irrelevant documents and asking it to answer from them. If generation is failing on good retrieval, the retriever is not the problem — the issue is in the prompt, the model, or context organization. Evaluate each stage independently before touching anything.

Retrieval Metrics That Matter

Retrieval metrics compare your retriever's output against a labeled dataset of queries with known relevant document IDs. Building this dataset is the most important step in retrieval evaluation — and the most commonly skipped. Without it, you're measuring nothing about retrieval and relying entirely on generation metrics to signal retrieval problems.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.