Beginner8 min
Text Splitters
RecursiveCharacterTextSplitter, chunk_size, chunk_overlap, splitting strategies for different content types.
Quick Reference
- →RecursiveCharacterTextSplitter is the default — it splits on paragraphs, then sentences, then words
- →chunk_size controls the maximum characters per chunk (typically 500-1500)
- →chunk_overlap ensures context continuity between adjacent chunks (typically 10-20% of chunk_size)
- →Specialized splitters exist for code (by language), Markdown (by headers), and HTML (by tags)
- →split_documents() preserves metadata from the original Document objects
Why Split?
LLMs have context windows. Vector stores work best with focused chunks. Splitting documents into the right-sized pieces is critical for retrieval quality — it determines whether your RAG pipeline returns precise, relevant answers or vague, diluted ones.
Chunk size sweet spot
The sweet spot is 500-1500 tokens per chunk with 10-20% overlap. Too small = fragmented context where individual chunks lack enough information to be useful. Too large = diluted relevance where the matching signal is buried in unrelated text.