Advanced RAG/RAG Fundamentals
Intermediate11 min

Chunking Strategies

Fixed-size, recursive, semantic, and document-aware chunking strategies. How chunk size affects retrieval quality, and how to choose the right approach for your data.

Quick Reference

  • Fixed-size chunking is simple but breaks semantic boundaries — use only for uniform text
  • RecursiveCharacterTextSplitter is the best default — it respects paragraph and sentence boundaries
  • Semantic chunking uses embeddings to split at meaning boundaries — best quality but slowest
  • Chunk size sweet spot is typically 500-1500 characters (~125-375 tokens) with 10-20% overlap
  • Too-small chunks lose context; too-large chunks add noise — test with your actual queries
  • Always preserve metadata (source, page, heading) through the chunking process

Fixed-Size Chunking

Fixed-size chunking splits text into chunks of exactly N characters (or tokens), regardless of content boundaries. It's the simplest approach and works when your documents are uniform prose without much structure. However, it frequently breaks mid-sentence or mid-paragraph, which destroys semantic coherence and hurts retrieval quality.

Fixed-size chunking — simple but often breaks semantics
The overlap trap

Overlap helps preserve context across chunk boundaries, but too much overlap wastes storage and embedding costs. A chunk_overlap of 10-20% of chunk_size is the sweet spot. At 50% overlap you're essentially doubling your storage and embedding costs for diminishing returns.