Chunking Strategies
Fixed-size, recursive, semantic, and document-aware chunking strategies. How chunk size affects retrieval quality, and how to choose the right approach for your data.
Quick Reference
- →Fixed-size chunking is simple but breaks semantic boundaries — use only for uniform text
- →RecursiveCharacterTextSplitter is the best default — it respects paragraph and sentence boundaries
- →Semantic chunking uses embeddings to split at meaning boundaries — best quality but slowest
- →Chunk size sweet spot is typically 500-1500 characters (~125-375 tokens) with 10-20% overlap
- →Too-small chunks lose context; too-large chunks add noise — test with your actual queries
- →Always preserve metadata (source, page, heading) through the chunking process
Fixed-Size Chunking
Fixed-size chunking splits text into chunks of exactly N characters (or tokens), regardless of content boundaries. It's the simplest approach and works when your documents are uniform prose without much structure. However, it frequently breaks mid-sentence or mid-paragraph, which destroys semantic coherence and hurts retrieval quality.
Overlap helps preserve context across chunk boundaries, but too much overlap wastes storage and embedding costs. A chunk_overlap of 10-20% of chunk_size is the sweet spot. At 50% overlap you're essentially doubling your storage and embedding costs for diminishing returns.