Advanced RAG/RAG in Production
Advanced14 min

Debugging Retrieval Failures

A systematic method for diagnosing RAG failures across five pipeline stages — from low similarity scores through hallucination. Covers observability setup with LangSmith, failure mode diagnosis, and the debugging loop that converts bug reports into regression tests.

Quick Reference

  • Check which pipeline stage failed before assuming it's an LLM problem
  • LangSmith + LANGCHAIN_TRACING_V2=true gives full traces with zero code changes
  • Vocabulary mismatch: keyword search finds the doc, vector search doesn't — add hybrid search
  • Silent failure: high similarity score, plausible-but-wrong document — most dangerous mode
  • Anti-hallucination prompt: 'Answer ONLY from the context. If not present, say so.'
  • Build a regression test from every bug report before closing the ticket
  • Alert on deviations from YOUR baseline, not arbitrary industry benchmarks

Triage: Which Stage Is Failing?

A RAG pipeline has five distinct failure points: the vector search itself, the metadata scope filter, the reranking pass, the context assembly, and the LLM generation step. Most debugging guides reduce this to 'retrieval vs generation' — a binary that misses three intermediate failure modes. When your system returns wrong answers, work through the five stages in order, stopping as soon as you find the break.

Triage: Which Stage Is Failing?Wrong Answer Reported1. Vector SearchCheck: avg similarity score ≥ 0.4?FAILLow Similarity ScoresEmbedding mismatch or wrong chunk size✓ PASS2. Metadata FilterCheck: returned docs match intended scope?FAILWrong Scope RetrievedMissing or inverted metadata filter✓ PASS3. RerankingCheck: most relevant docs in positions 1–3?FAILPoor Reranking OrderCross-encoder threshold needs tuning✓ PASS4. Context AssemblyCheck: answer text present in context?FAILAnswer Not in ContextChunks too small — increase overlap✓ PASS5. GenerationCheck: answer grounded in retrieved docs?FAILHallucinationPrompt not constraining to context✓ PASSCorrect Answer

stop at the first failing stage — fixing downstream stages won't help if the break is upstream

StageWhat to CheckHow to Check
1. Vector SearchDoes any top-10 result have similarity ≥ 0.4 to the correct document?vectorstore.similarity_search_with_score(query, k=10) — print scores
2. Metadata FilterDoes removing all filters make the correct document appear?Re-run without filter kwargs — if it appears now, the filter is the bug
3. RerankingAre the top-3 results after reranking actually the most relevant?cross_encoder.predict([(query, doc.page_content) for doc in retrieved])
4. Context AssemblyIs the answer text present anywhere in the assembled context string?Python: any(keyword in context for keyword in expected_answer.split())
5. GenerationDoes the answer claim facts not present in the retrieved context?LangSmith trace: compare assembled context vs generated answer
Start at stage 1, always

Engineers routinely skip to stage 5 and spend a sprint tuning prompts when a broken metadata filter at stage 2 is silently scoping every query to the wrong tenant. The five-stage check takes 15 minutes. The wrong treatment wastes a sprint.