★ OverviewIntermediate22 min

RAG: When to Use It, How to Build It, and How It Breaks

The first RAG decision is whether to use RAG at all — with 200K+ token context windows, it's a choice, not a given. This article covers the RAG-vs-long-context decision framework with cost math, building an indexing and retrieval pipeline, evaluation with concrete thresholds, production failure modes, monitoring, and a production-shaped LangGraph reference implementation.

Quick Reference

→RAG = Retrieve relevant documents, Augment the prompt, Generate a grounded answer — but only use it when long-context won't do
→With 200K+ token context windows, corpora under 50K tokens at low query volume are often cheaper with long-context than RAG infrastructure
→RAG becomes cost-effective at scale: 500 queries/day on an 80K-token corpus costs ~$6/day with RAG vs. ~$120/day with long-context
→Chunk at 500–1000 tokens with 10–20% overlap; embed with text-embedding-3-small ($0.02/1M tokens); store in Chroma (dev) or Pinecone/pgvector (prod)
→Evaluate retrieval (precision@5 > 0.7, recall@10 > 0.8) separately from generation (faithfulness > 0.9) — bad retrieval guarantees bad answers
→Monitor: retrieval score trend, zero-result rate, P95 latency, index freshness — RAG fails silently, not with errors
→Add a retrieval quality gate: if no document scores above the similarity threshold, return a fallback instead of generating from irrelevant context
→Start with naive RAG; add hybrid search, reranking, or query rewriting only after evaluation reveals a specific gap

Should I Use RAG at All?

RAG is a choice, not a default

Claude has a 200K-token context window. Gemini has 1M tokens. The assumption that made RAG necessary in 2023 — 'you can't fit your knowledge base in context' — no longer holds for most corpora. The first RAG decision is whether to build RAG infrastructure at all, or whether long-context stuffing is simpler, cheaper, and more accurate for your specific case.

answer Yes to any question → use RAG; answer No to all → evaluate alternatives

Factor	Use RAG	Use Long-Context	Why It Matters
Corpus size	> 200K tokens (larger than context window)	Fits in context window	Above the limit, long-context is physically impossible
Corpus stability	Changes frequently; re-index incrementally	Stable; no change pipeline needed	Stale context is worse than no context
Query volume	> 500/day on same corpus	< 100/day	High volume makes RAG's embed-once/retrieve-many economics win
Need source attribution	Chunk metadata gives citations	No natural citation mechanism	Regulated domains require knowing which document answered the query
Need full-corpus reasoning	Avoid — RAG returns fragments	Model sees everything	Multi-hop reasoning across the whole corpus requires the whole corpus
Latency budget	> 1s acceptable	< 500ms required	Retrieval adds 100–300ms; matters for real-time applications
Corpus size 50K–200K tokens	RAG works but may be overkill	Works, higher per-query cost	Run the cost math below — depends on your query volume

Cost math: 80K-token corpus, 500 queries/day, Claude Sonnet at $3/1M input tokens. Long-context: 80K × 500 = 40M tokens/day × $3/1M = $120/day. RAG: embed query (negligible at $0.02/1M) + vector search (negligible) + 4 chunks × 500 tokens × 500 queries = 1M tokens/day × $3/1M = $3/day plus ~$0.01/day embedding cost. RAG is ~40× cheaper at this scale. But for a 5K-token corpus at 20 queries/day: long-context = $0.30/day. RAG adds indexing infrastructure, a reindexing pipeline, and a vector store for $0.28/day in savings. The math doesn't justify it.

Three cases where RAG is the wrong answer

1. Your corpus fits in the context window and query volume is under 100/day — just include it in the system prompt. 2. You need the model to reason across the entire corpus simultaneously — RAG retrieves fragments, and cross-document reasoning requires the whole picture. 3. Your 'knowledge base' is a single document under 10K tokens — include it directly.

Building the Indexing Pipeline

each stage can fail independently · checkpoint allows resume · DLQ prevents pipeline stalls

Retrieval: Strategies That Actually Compare

Retrieval is where most RAG quality problems originate. This table compares strategies by what matters in production: which specific failure mode they fix, how much latency they add, and what they cost per query. Don't add a layer unless you have that failure mode.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.