Text Splitters
LangChain's text splitter API: when to split, which splitter to choose, token-based production splitting, metadata propagation, and the three failure modes that destroy RAG quality.
Quick Reference
- →RecursiveCharacterTextSplitter is the default — tries paragraph breaks, then sentences, then words
- →Use .from_tiktoken_encoder(chunk_size=256) in production — 1000 chars can be 200–400 tokens
- →MarkdownHeaderTextSplitter preserves header hierarchy as metadata — chain with Recursive for oversized sections
- →RecursiveCharacterTextSplitter.from_language(Language.PYTHON) splits on function/class boundaries first
- →SemanticChunker (langchain_experimental) splits at meaning shifts but embeds every sentence at ingest
- →split_documents() preserves metadata; split_text() discards it — never mix them in a pipeline
- →Two-stage splitting: structure-aware splitter first, then Recursive for any chunk still over the limit
- →RecursiveJsonSplitter keeps nested objects intact across chunk boundaries
When Not to Split
Before picking a splitter, ask whether you need to split at all. Splitting has a cost: chunks lose surrounding context, metadata must be propagated manually, and retrieval can return half an answer. Skip splitting when your documents are already short (under 512 tokens), when you're using a long-context model that can take the full document, when your data is pre-structured (each API record is already one logical unit), or when you're building a summarization pipeline that needs the full document intact.
Run your document corpus through a tokenizer before choosing a chunk size. If 80% of your documents are under 500 tokens, splitting may only hurt retrieval quality. The right question is: does chunking improve or degrade my retrieval scores?