Document Loaders
How to get text out of PDFs, web pages, Notion, and 200+ other sources — and into your RAG pipeline. Covers loader selection, memory-safe loading, metadata strategy, failure modes, and the production pipeline pattern.
Quick Reference
- →DocumentLoader's job is one thing: extract text + metadata from a source, return Document objects
- →lazy_load() streams one Document at a time — use it in production; load() blows up on large corpora
- →alazy_load() is the async streaming variant — use it in async applications instead of aload()
- →Each Document: page_content (str) + metadata (dict with source, page, etc.)
- →Metadata set at load time travels through split → embed → retrieval — you cannot add it later
- →WebBaseLoader is deprecated (v0.3.14) — use FireCrawlLoader or SpiderLoader for web scraping
- →RetrievalQA is deprecated (v0.1.17) — use create_retrieval_chain + create_stuff_documents_chain
- →When no community loader fits, subclass BaseLoader and implement lazy_load() — that's all it takes
When You Don't Need a Document Loader
Before reaching for a loader, check whether you actually need one. A loader exists to extract text from an unstructured or semi-structured source — if the text is already extracted and accessible via a normal API or database query, a loader adds friction without value.
| Your data lives in… | Need a loader? | Why |
|---|---|---|
| PDFs, Word docs, PPTs, HTML files | Yes | Text is locked in binary or markup — no other way to extract it |
| A SQL table with text columns | No | Query directly: SELECT content FROM docs WHERE … then pass strings to the vector store |
| A SQL table with structured rows (e.g. orders, products) | Situational | SQLDatabaseLoader converts rows → text for semantic search; skip it if the data is better queried with SQL filters |
| A REST API with a known JSON schema | No | Call the API, extract the fields you care about, format them as strings — no loader needed |
| Notion, Confluence, Slack, Google Drive | Yes | Community loaders handle OAuth, pagination, and extraction — reinventing this is a waste of a day |
| Plain text already in S3 or a bucket | Sometimes | If text is pre-extracted, read it directly. Use S3FileLoader only if you need directory traversal and metadata injection in one step |
If you can write a SQL query or API call that returns the text you need as a string in under five minutes, skip the loader. DocumentLoader earns its keep when extraction is the hard part — binary formats, auth flows, pagination, multi-page documents.