Intermediate14 min

Chunking Strategies

How to choose, tune, and evaluate chunking strategies for RAG. Covers recursive, document-aware, and semantic splitting — plus Contextual Retrieval and Late Chunking, the two post-2024 techniques that address the root cause of most retrieval failures.

Quick Reference

→RecursiveCharacterTextSplitter is the recommended default for unstructured text — it respects paragraph and sentence boundaries
→Always split by tokens (from_tiktoken_encoder), not characters — LLMs and embeddings operate on tokens, not bytes
→Use two-stage splitting for structured docs: split by headers/sections first, then recursive for oversized chunks
→Semantic chunking gives highest-quality boundaries but adds per-sentence embedding latency — use for high-value, smaller corpora
→Contextual Retrieval (Anthropic) prepends LLM-generated context to each chunk before embedding — reduces retrieval failure by ~49%
→Late Chunking (Jina AI) embeds the full document first and then splits, preserving cross-reference context automatically
→Chunk size sweet spot is typically 200–500 tokens with 10–20% overlap as a starting point — measure retrieval precision@k to confirm
→Always propagate metadata (source, page, heading) through the splitting pipeline — a chunk without provenance is unreliable

Why Chunking Matters

Retrieval quality in RAG is bounded by chunk quality. You can use the best vector store, the most expensive reranker, and the largest context window — and still get wrong answers if your chunks split a table in half, break a code block across two chunks, or lose the reference that makes a pronoun meaningful. Chunking is not a preprocessing detail. It is the most impactful parameter in your retrieval pipeline.

character splitting destroys table and code context — use structure-aware splitters for structured content

The diagram above shows the canonical failure: a character-based splitter hits a 1000-character limit in the middle of an API reference table. Chunk 2 gets the data rows without the header row. The embedding model encodes 'GET | /users | Bearer token' with no context about what those columns mean. Any query for authentication endpoints will retrieve chunks that are technically relevant by keyword but semantically broken.

The root cause of most RAG failures

When a RAG system retrieves chunks that contain the right keywords but not the right context, it often produces plausible-but-wrong answers. Chunking failures are hard to debug because the retrieved chunk looks relevant — the issue is what the chunk is missing, not what it contains.

Choosing a Chunking Strategy

There are four chunking strategies, each optimized for a different tradeoff between quality, speed, and cost. The choice is not about which is 'best' in the abstract — it depends on your document types, corpus size, and how much retrieval quality is worth in your application.

Recursive and Document-Aware Splitting

RecursiveCharacterTextSplitter is the recommended default for unstructured text. It tries to split on the largest semantic boundary first (double newline = paragraph break), then falls back to smaller boundaries if the chunk is still too large. This respects natural text structure while staying within size limits. For structured documents, the two-stage pattern — structure-aware splitter first, then recursive for oversized chunks — is the production standard.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.