Text Splitters

LangChain's text splitter API: when to split, which splitter to choose, token-based production splitting, metadata propagation, and the three failure modes that destroy RAG quality.

Quick Reference

→RecursiveCharacterTextSplitter is the default — tries paragraph breaks, then sentences, then words
→Use .from_tiktoken_encoder(chunk_size=256) in production — 1000 chars can be 200–400 tokens
→MarkdownHeaderTextSplitter preserves header hierarchy as metadata — chain with Recursive for oversized sections
→RecursiveCharacterTextSplitter.from_language(Language.PYTHON) splits on function/class boundaries first
→SemanticChunker (langchain_experimental) splits at meaning shifts but embeds every sentence at ingest
→split_documents() preserves metadata; split_text() discards it — never mix them in a pipeline
→Two-stage splitting: structure-aware splitter first, then Recursive for any chunk still over the limit
→RecursiveJsonSplitter keeps nested objects intact across chunk boundaries

When Not to Split

Before picking a splitter, ask whether you need to split at all. Splitting has a cost: chunks lose surrounding context, metadata must be propagated manually, and retrieval can return half an answer. Skip splitting when your documents are already short (under 512 tokens), when you're using a long-context model that can take the full document, when your data is pre-structured (each API record is already one logical unit), or when you're building a summarization pipeline that needs the full document intact.

Measure first

Run your document corpus through a tokenizer before choosing a chunk size. If 80% of your documents are under 500 tokens, splitting may only hurt retrieval quality. The right question is: does chunking improve or degrade my retrieval scores?

Choosing a Splitter

The splitter you choose determines whether your chunks are semantically coherent or structurally broken. Content type is the primary selection criterion — not chunk size, which you tune after picking the right splitter.

RecursiveCharacterTextSplitter

The recursive splitter tries the widest separator first (paragraph breaks), falls back to narrower ones (sentence endings, then words) until the chunk fits within chunk_size. This keeps logical units together: a paragraph stays intact if it fits, and only breaks at sentence boundaries if it's too long. In production, always use the token-based variant — character counts are inconsistent across languages and code.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.