LangChain/Data Pipeline
Intermediate14 min

Text Splitters

LangChain's text splitter API: when to split, which splitter to choose, token-based production splitting, metadata propagation, and the three failure modes that destroy RAG quality.

Quick Reference

  • RecursiveCharacterTextSplitter is the default — tries paragraph breaks, then sentences, then words
  • Use .from_tiktoken_encoder(chunk_size=256) in production — 1000 chars can be 200–400 tokens
  • MarkdownHeaderTextSplitter preserves header hierarchy as metadata — chain with Recursive for oversized sections
  • RecursiveCharacterTextSplitter.from_language(Language.PYTHON) splits on function/class boundaries first
  • SemanticChunker (langchain_experimental) splits at meaning shifts but embeds every sentence at ingest
  • split_documents() preserves metadata; split_text() discards it — never mix them in a pipeline
  • Two-stage splitting: structure-aware splitter first, then Recursive for any chunk still over the limit
  • RecursiveJsonSplitter keeps nested objects intact across chunk boundaries

When Not to Split

Before picking a splitter, ask whether you need to split at all. Splitting has a cost: chunks lose surrounding context, metadata must be propagated manually, and retrieval can return half an answer. Skip splitting when your documents are already short (under 512 tokens), when you're using a long-context model that can take the full document, when your data is pre-structured (each API record is already one logical unit), or when you're building a summarization pipeline that needs the full document intact.

Measure first

Run your document corpus through a tokenizer before choosing a chunk size. If 80% of your documents are under 500 tokens, splitting may only hurt retrieval quality. The right question is: does chunking improve or degrade my retrieval scores?