Self-Corrective RAG: Grade, Rewrite, Re-Retrieve

Corrective RAG adds document grading and query rewriting to the retrieval loop — if retrieved documents don't answer the question, the system rewrites the query and retrieves again. This article covers when the complexity is justified, the real cost and latency tradeoffs, and how to build, evaluate, and monitor the grading loop in production.

Quick Reference

→Only add self-corrective RAG if ≥20% of your queries return zero relevant documents in top-5 — measure first
→The grade-rewrite loop adds ~$0.009 and ~2s per query at zero retries; ~$0.019 and ~5.7s with one retry (sequential grading)
→Use structured output (GradeDocuments) for consistent yes/no relevance judgments with reasoning
→Hybrid grading: skip LLM for scores <0.3 (reject) and >0.8 (accept); grade only the 0.3–0.8 band
→Set max retries to 2 — a third retry almost never finds documents that two failed to retrieve
→Build a grading eval set (30+ queries, human-labeled) before deploying — a miscalibrated grader makes things worse

When Self-Corrective RAG Is Overkill

Self-corrective RAG solves one specific problem: the retriever returns irrelevant documents because the query was ambiguous, jargon-heavy, or terminologically mismatched with the corpus. If that's not your problem, the grading loop adds cost and latency without improving answers.

Measure before you build

Take 50 real queries from your production traffic. For each, retrieve top-5 documents and manually check whether at least one is relevant. If fewer than 20% of queries have zero relevant documents in top-5, self-corrective RAG won't help — fix your chunking strategy or embedding model first.

Your situation	Right fix
Standard RAG returns relevant docs for 90%+ of queries	Don't add self-corrective RAG — you're solving the wrong problem
Retrieval works but answers are hallucinated	Add answer validation or a better generation model — document grading won't help
Relevant docs exist but aren't retrieved — query/corpus terminology mismatch	Self-corrective RAG with query rewriting is the right fix
Latency budget is under 2 seconds	Sequential grading alone takes ~2s — don't add this loop
Small, well-indexed corpus with consistent terminology	Standard RAG is sufficient — the grader will rarely trigger

The case where self-corrective RAG clearly earns its complexity: users write queries using colloquial or domain-inconsistent language ('the thing that makes blood clot' instead of 'coagulation cascade'), and your corpus uses formal terminology. The grader catches the mismatch and the rewriter bridges it.

What the Loop Actually Costs

Before writing any code, understand the cost model. The grading loop calls the LLM once per document — for 5 retrieved documents, that's 5 LLM calls before a single generation happens. The math below uses claude-sonnet-4-6 pricing as of April 2026 ($3/M input, $15/M output) with stated assumptions: 5 documents at ~300 tokens each, ~50-token grading responses, ~1,500-token generation context.

The Core Loop: Grade, Rewrite, Re-Retrieve

Retrieve → grade → generate if relevant, rewrite and re-retrieve if not (max 2 retries)

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.