Debugging Retrieval Failures
A systematic method for diagnosing RAG failures across five pipeline stages — from low similarity scores through hallucination. Covers observability setup with LangSmith, failure mode diagnosis, and the debugging loop that converts bug reports into regression tests.
Quick Reference
- →Check which pipeline stage failed before assuming it's an LLM problem
- →LangSmith + LANGCHAIN_TRACING_V2=true gives full traces with zero code changes
- →Vocabulary mismatch: keyword search finds the doc, vector search doesn't — add hybrid search
- →Silent failure: high similarity score, plausible-but-wrong document — most dangerous mode
- →Anti-hallucination prompt: 'Answer ONLY from the context. If not present, say so.'
- →Build a regression test from every bug report before closing the ticket
- →Alert on deviations from YOUR baseline, not arbitrary industry benchmarks
Triage: Which Stage Is Failing?
A RAG pipeline has five distinct failure points: the vector search itself, the metadata scope filter, the reranking pass, the context assembly, and the LLM generation step. Most debugging guides reduce this to 'retrieval vs generation' — a binary that misses three intermediate failure modes. When your system returns wrong answers, work through the five stages in order, stopping as soon as you find the break.
stop at the first failing stage — fixing downstream stages won't help if the break is upstream
| Stage | What to Check | How to Check |
|---|---|---|
| 1. Vector Search | Does any top-10 result have similarity ≥ 0.4 to the correct document? | vectorstore.similarity_search_with_score(query, k=10) — print scores |
| 2. Metadata Filter | Does removing all filters make the correct document appear? | Re-run without filter kwargs — if it appears now, the filter is the bug |
| 3. Reranking | Are the top-3 results after reranking actually the most relevant? | cross_encoder.predict([(query, doc.page_content) for doc in retrieved]) |
| 4. Context Assembly | Is the answer text present anywhere in the assembled context string? | Python: any(keyword in context for keyword in expected_answer.split()) |
| 5. Generation | Does the answer claim facts not present in the retrieved context? | LangSmith trace: compare assembled context vs generated answer |
Engineers routinely skip to stage 5 and spend a sprint tuning prompts when a broken metadata filter at stage 2 is silently scoping every query to the wrong tenant. The five-stage check takes 15 minutes. The wrong treatment wastes a sprint.