Advanced RAG/RAG Fundamentals
Intermediate14 min

Chunking Strategies

How to choose, tune, and evaluate chunking strategies for RAG. Covers recursive, document-aware, and semantic splitting — plus Contextual Retrieval and Late Chunking, the two post-2024 techniques that address the root cause of most retrieval failures.

Quick Reference

  • RecursiveCharacterTextSplitter is the recommended default for unstructured text — it respects paragraph and sentence boundaries
  • Always split by tokens (from_tiktoken_encoder), not characters — LLMs and embeddings operate on tokens, not bytes
  • Use two-stage splitting for structured docs: split by headers/sections first, then recursive for oversized chunks
  • Semantic chunking gives highest-quality boundaries but adds per-sentence embedding latency — use for high-value, smaller corpora
  • Contextual Retrieval (Anthropic) prepends LLM-generated context to each chunk before embedding — reduces retrieval failure by ~49%
  • Late Chunking (Jina AI) embeds the full document first and then splits, preserving cross-reference context automatically
  • Chunk size sweet spot is typically 200–500 tokens with 10–20% overlap as a starting point — measure retrieval precision@k to confirm
  • Always propagate metadata (source, page, heading) through the splitting pipeline — a chunk without provenance is unreliable

Why Chunking Matters

Retrieval quality in RAG is bounded by chunk quality. You can use the best vector store, the most expensive reranker, and the largest context window — and still get wrong answers if your chunks split a table in half, break a code block across two chunks, or lose the reference that makes a pronoun meaningful. Chunking is not a preprocessing detail. It is the most impactful parameter in your retrieval pipeline.

CharacterTextSplitter✗ splits every 1000 chars — no structure awareness## AuthenticationMethod | Endpoint | Auth RequiredGET | /users | Bearer tokenPOST | /users | Bearer token✗ SPLIT HERE (char 1000)DELETE | /users/:id | Bearer tokenPATCH | /users/:id | Bearer tokenChunk 2 has table rows without the header"Method | Endpoint | Auth Required" is goneChunk 1Chunk 2 — no table header → context destroyedMarkdownHeaderTextSplitter✓ splits at ## boundaries — structure preserved## Authentication✓ Chunk 1: heading + intro paragraph✓ SPLIT HERE (at ## boundary)Method | Endpoint | Auth RequiredGET | /users | Bearer tokenPOST | /users | Bearer tokenDELETE | /users/:id | Bearer tokenPATCH | /users/:id | Bearer token✓ Chunk 2 has the complete tableheader row + all data rows — context intactSame failure applies to code blocks — never let a function definition span two chunks

character splitting destroys table and code context — use structure-aware splitters for structured content

The diagram above shows the canonical failure: a character-based splitter hits a 1000-character limit in the middle of an API reference table. Chunk 2 gets the data rows without the header row. The embedding model encodes 'GET | /users | Bearer token' with no context about what those columns mean. Any query for authentication endpoints will retrieve chunks that are technically relevant by keyword but semantically broken.

The root cause of most RAG failures

When a RAG system retrieves chunks that contain the right keywords but not the right context, it often produces plausible-but-wrong answers. Chunking failures are hard to debug because the retrieved chunk looks relevant — the issue is what the chunk is missing, not what it contains.