Advanced RAG/RAG in Production
Advanced18 min

Evaluating RAG Systems

A production-grade RAG evaluation playbook: two-stage failure diagnosis, retrieval metrics, LLM-as-Judge generation scoring, the updated RAGAS v0.4 API, golden dataset construction, CI threshold gates, and online monitoring.

Quick Reference

  • Evaluate retrieval and generation separately — they fail for different reasons, requiring different fixes
  • Build a golden dataset of 50-100 queries before tuning parameters — you need a baseline before you can optimize
  • Retrieval metrics (precision@k, recall@k, MRR, NDCG) require labeled data; generation metrics (faithfulness, relevance) can be scored with LLM-as-Judge
  • RAGAS v0.4 uses ragas.metrics.collections — the old ragas.metrics imports are deprecated and will break in v1.0
  • Set per-metric CI thresholds and block merges that cause regressions — evaluation only works if it gates changes
  • Monitor 5–10% of production queries with async LLM-as-Judge to detect corpus drift before users report it
  • Evaluation itself can lie: judge model bias, distribution mismatch, and golden dataset rot are the three most common reasons eval says good while users say bad

When (and When Not) to Formally Evaluate

Formal RAG evaluation — golden datasets, metric computation, CI gates — has real setup cost. Before building the harness, answer one question: what happens when this system gives a wrong answer? If the answer is 'a team member notices and corrects it' or 'it's a prototype and wrong answers don't matter yet,' you don't need formal eval. If the answer is 'a customer is misled' or 'a compliance report is wrong,' you do.

SituationRecommendation
< 50 documents, internal team onlyVibes-based is fine — just run 10 queries manually
Comparing two retrieval strategiesNeed eval: without metrics you cannot know which is better
Changing chunking, embedding, or indexNeed eval: these changes affect the entire pipeline
Public-facing product or customer-facing botNeed eval + CI gates: regressions will reach users
Medical, legal, financial, compliance domainNeed eval + CI gates + online monitoring: wrong answers have consequences
The vibes trap

Running 5 test queries and eyeballing the answers is not evaluation — it's confirmation bias with extra steps. You will unconsciously choose queries you expect to work. A 50-query golden dataset with computed metrics will surface failure modes that manual spot-checking misses every time.