LangChain/Data Pipeline
Beginner8 min

Text Splitters

RecursiveCharacterTextSplitter, chunk_size, chunk_overlap, splitting strategies for different content types.

Quick Reference

  • RecursiveCharacterTextSplitter is the default — it splits on paragraphs, then sentences, then words
  • chunk_size controls the maximum characters per chunk (typically 500-1500)
  • chunk_overlap ensures context continuity between adjacent chunks (typically 10-20% of chunk_size)
  • Specialized splitters exist for code (by language), Markdown (by headers), and HTML (by tags)
  • split_documents() preserves metadata from the original Document objects

Why Split?

LLMs have context windows. Vector stores work best with focused chunks. Splitting documents into the right-sized pieces is critical for retrieval quality — it determines whether your RAG pipeline returns precise, relevant answers or vague, diluted ones.

Chunk size sweet spot

The sweet spot is 500-1500 tokens per chunk with 10-20% overlap. Too small = fragmented context where individual chunks lack enough information to be useful. Too large = diluted relevance where the matching signal is buried in unrelated text.