RAG: When to Use It, How to Build It, and How It Breaks
The first RAG decision is whether to use RAG at all — with 200K+ token context windows, it's a choice, not a given. This article covers the RAG-vs-long-context decision framework with cost math, building an indexing and retrieval pipeline, evaluation with concrete thresholds, production failure modes, monitoring, and a production-shaped LangGraph reference implementation.
Quick Reference
- →RAG = Retrieve relevant documents, Augment the prompt, Generate a grounded answer — but only use it when long-context won't do
- →With 200K+ token context windows, corpora under 50K tokens at low query volume are often cheaper with long-context than RAG infrastructure
- →RAG becomes cost-effective at scale: 500 queries/day on an 80K-token corpus costs ~$6/day with RAG vs. ~$120/day with long-context
- →Chunk at 500–1000 tokens with 10–20% overlap; embed with text-embedding-3-small ($0.02/1M tokens); store in Chroma (dev) or Pinecone/pgvector (prod)
- →Evaluate retrieval (precision@5 > 0.7, recall@10 > 0.8) separately from generation (faithfulness > 0.9) — bad retrieval guarantees bad answers
- →Monitor: retrieval score trend, zero-result rate, P95 latency, index freshness — RAG fails silently, not with errors
- →Add a retrieval quality gate: if no document scores above the similarity threshold, return a fallback instead of generating from irrelevant context
- →Start with naive RAG; add hybrid search, reranking, or query rewriting only after evaluation reveals a specific gap
Should I Use RAG at All?
Claude has a 200K-token context window. Gemini has 1M tokens. The assumption that made RAG necessary in 2023 — 'you can't fit your knowledge base in context' — no longer holds for most corpora. The first RAG decision is whether to build RAG infrastructure at all, or whether long-context stuffing is simpler, cheaper, and more accurate for your specific case.
answer Yes to any question → use RAG; answer No to all → evaluate alternatives
| Factor | Use RAG | Use Long-Context | Why It Matters |
|---|---|---|---|
| Corpus size | > 200K tokens (larger than context window) | Fits in context window | Above the limit, long-context is physically impossible |
| Corpus stability | Changes frequently; re-index incrementally | Stable; no change pipeline needed | Stale context is worse than no context |
| Query volume | > 500/day on same corpus | < 100/day | High volume makes RAG's embed-once/retrieve-many economics win |
| Need source attribution | Chunk metadata gives citations | No natural citation mechanism | Regulated domains require knowing which document answered the query |
| Need full-corpus reasoning | Avoid — RAG returns fragments | Model sees everything | Multi-hop reasoning across the whole corpus requires the whole corpus |
| Latency budget | > 1s acceptable | < 500ms required | Retrieval adds 100–300ms; matters for real-time applications |
| Corpus size 50K–200K tokens | RAG works but may be overkill | Works, higher per-query cost | Run the cost math below — depends on your query volume |
Cost math: 80K-token corpus, 500 queries/day, Claude Sonnet at $3/1M input tokens. Long-context: 80K × 500 = 40M tokens/day × $3/1M = $120/day. RAG: embed query (negligible at $0.02/1M) + vector search (negligible) + 4 chunks × 500 tokens × 500 queries = 1M tokens/day × $3/1M = $3/day plus ~$0.01/day embedding cost. RAG is ~40× cheaper at this scale. But for a 5K-token corpus at 20 queries/day: long-context = $0.30/day. RAG adds indexing infrastructure, a reindexing pipeline, and a vector store for $0.28/day in savings. The math doesn't justify it.
1. Your corpus fits in the context window and query volume is under 100/day — just include it in the system prompt. 2. You need the model to reason across the entire corpus simultaneously — RAG retrieves fragments, and cross-document reasoning requires the whole picture. 3. Your 'knowledge base' is a single document under 10K tokens — include it directly.