Debugging Retrieval Failures

A systematic method for diagnosing RAG failures across five pipeline stages — from low similarity scores through hallucination. Covers observability setup with LangSmith, failure mode diagnosis, and the debugging loop that converts bug reports into regression tests.

Quick Reference

→Check which pipeline stage failed before assuming it's an LLM problem
→LangSmith + LANGCHAIN_TRACING_V2=true gives full traces with zero code changes
→Vocabulary mismatch: keyword search finds the doc, vector search doesn't — add hybrid search
→Silent failure: high similarity score, plausible-but-wrong document — most dangerous mode
→Anti-hallucination prompt: 'Answer ONLY from the context. If not present, say so.'
→Build a regression test from every bug report before closing the ticket
→Alert on deviations from YOUR baseline, not arbitrary industry benchmarks

Triage: Which Stage Is Failing?

A RAG pipeline has five distinct failure points: the vector search itself, the metadata scope filter, the reranking pass, the context assembly, and the LLM generation step. Most debugging guides reduce this to 'retrieval vs generation' — a binary that misses three intermediate failure modes. When your system returns wrong answers, work through the five stages in order, stopping as soon as you find the break.

stop at the first failing stage — fixing downstream stages won't help if the break is upstream

Stage	What to Check	How to Check
1. Vector Search	Does any top-10 result have similarity ≥ 0.4 to the correct document?	vectorstore.similarity_search_with_score(query, k=10) — print scores
2. Metadata Filter	Does removing all filters make the correct document appear?	Re-run without filter kwargs — if it appears now, the filter is the bug
3. Reranking	Are the top-3 results after reranking actually the most relevant?	cross_encoder.predict([(query, doc.page_content) for doc in retrieved])
4. Context Assembly	Is the answer text present anywhere in the assembled context string?	Python: any(keyword in context for keyword in expected_answer.split())
5. Generation	Does the answer claim facts not present in the retrieved context?	LangSmith trace: compare assembled context vs generated answer

Start at stage 1, always

Engineers routinely skip to stage 5 and spend a sprint tuning prompts when a broken metadata filter at stage 2 is silently scoping every query to the wrong tenant. The five-stage check takes 15 minutes. The wrong treatment wastes a sprint.

Setting Up Observability

You cannot debug a failure you have no trace of. A user reports a wrong answer on Tuesday — without traces, you have to hope it reproduces. With LangSmith enabled from day one, you pull the exact trace: query, retrieved docs with scores, assembled context, and generated answer. The window to diagnose the original failure closes fast; the trace keeps it open.

Retrieval Failure Modes

Failure Mode	Symptom	Confirm It	Fix
Chunk too small	Top results are relevant fragments — the answer requires combining adjacent chunks	Read the top-3 chunks. Can any single chunk answer the question alone? If the answer spans two consecutive chunks, this is your failure.	Increase chunk size (e.g., 512 → 1024 tokens) or use a parent document retriever to return the full parent section when a child chunk matches
Chunk too large	High similarity scores but the answer is buried inside a large, noisy chunk	Print the first 500 chars of the top result. If less than ~20% of the text is directly relevant to the query, the chunk is oversized.	Decrease chunk size; add a contextual compression retriever (LLMChainExtractor) to extract only the relevant sentence(s) from the retrieved chunk
Vocabulary mismatch	Relevant documents exist in the index but vector search misses them entirely	Run BM25 keyword search on the same query. If BM25 returns the correct document and vector search doesn't, you have vocabulary mismatch — not a missing document.	Add hybrid search (BM25 + semantic with RRF fusion); consider query expansion via HyDE or multi-query retriever to bridge the vocabulary gap
Stale index	Retrieved documents contain outdated information despite the source document having been updated	Check indexed_at metadata on top results. If it is older than the source document's updated_at timestamp, the index is stale.	Implement incremental reindexing triggered by document changes (webhook or CDC); add indexed_at metadata and filter by max acceptable staleness
Missing metadata filter	Retrieved documents are topically relevant but from the wrong tenant, product line, or date range	Remove all metadata filters and re-run the query. If the correct document appears in unfiltered results, your filter is over-restricting or contains an inversion bug.	Audit filter logic for off-by-one errors, inverted boolean conditions, or wrong field names; use self-querying retriever for dynamic filter construction from the query
Embedding model mismatch	Consistently low similarity scores (< 0.3) across all queries, even for near-exact match queries	Embed a document and its own first sentence, then compute cosine similarity. It should be ≥ 0.95. If it isn't, your query and index were embedded by different model versions.	Always use the same embedding model at index time and query time. If you upgraded the model, a full reindex is required — partial reindexes corrupt the similarity space permanently

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.