LangChain/Data Pipeline
★ OverviewBeginner12 min

Document Loaders

How to get text out of PDFs, web pages, Notion, and 200+ other sources — and into your RAG pipeline. Covers loader selection, memory-safe loading, metadata strategy, failure modes, and the production pipeline pattern.

Quick Reference

  • DocumentLoader's job is one thing: extract text + metadata from a source, return Document objects
  • lazy_load() streams one Document at a time — use it in production; load() blows up on large corpora
  • alazy_load() is the async streaming variant — use it in async applications instead of aload()
  • Each Document: page_content (str) + metadata (dict with source, page, etc.)
  • Metadata set at load time travels through split → embed → retrieval — you cannot add it later
  • WebBaseLoader is deprecated (v0.3.14) — use FireCrawlLoader or SpiderLoader for web scraping
  • RetrievalQA is deprecated (v0.1.17) — use create_retrieval_chain + create_stuff_documents_chain
  • When no community loader fits, subclass BaseLoader and implement lazy_load() — that's all it takes

When You Don't Need a Document Loader

Before reaching for a loader, check whether you actually need one. A loader exists to extract text from an unstructured or semi-structured source — if the text is already extracted and accessible via a normal API or database query, a loader adds friction without value.

Your data lives in…Need a loader?Why
PDFs, Word docs, PPTs, HTML filesYesText is locked in binary or markup — no other way to extract it
A SQL table with text columnsNoQuery directly: SELECT content FROM docs WHERE … then pass strings to the vector store
A SQL table with structured rows (e.g. orders, products)SituationalSQLDatabaseLoader converts rows → text for semantic search; skip it if the data is better queried with SQL filters
A REST API with a known JSON schemaNoCall the API, extract the fields you care about, format them as strings — no loader needed
Notion, Confluence, Slack, Google DriveYesCommunity loaders handle OAuth, pagination, and extraction — reinventing this is a waste of a day
Plain text already in S3 or a bucketSometimesIf text is pre-extracted, read it directly. Use S3FileLoader only if you need directory traversal and metadata injection in one step
The skip-loader test

If you can write a SQL query or API call that returns the text you need as a string in under five minutes, skip the loader. DocumentLoader earns its keep when extraction is the hard part — binary formats, auth flows, pagination, multi-page documents.