Document Loaders

How to get text out of PDFs, web pages, Notion, and 200+ other sources — and into your RAG pipeline. Covers loader selection, memory-safe loading, metadata strategy, failure modes, and the production pipeline pattern.

Quick Reference

→DocumentLoader's job is one thing: extract text + metadata from a source, return Document objects
→lazy_load() streams one Document at a time — use it in production; load() blows up on large corpora
→alazy_load() is the async streaming variant — use it in async applications instead of aload()
→Each Document: page_content (str) + metadata (dict with source, page, etc.)
→Metadata set at load time travels through split → embed → retrieval — you cannot add it later
→WebBaseLoader is deprecated (v0.3.14) — use FireCrawlLoader or SpiderLoader for web scraping
→RetrievalQA is deprecated (v0.1.17) — use create_retrieval_chain + create_stuff_documents_chain
→When no community loader fits, subclass BaseLoader and implement lazy_load() — that's all it takes

When You Don't Need a Document Loader

Before reaching for a loader, check whether you actually need one. A loader exists to extract text from an unstructured or semi-structured source — if the text is already extracted and accessible via a normal API or database query, a loader adds friction without value.

Your data lives in…	Need a loader?	Why
PDFs, Word docs, PPTs, HTML files	Yes	Text is locked in binary or markup — no other way to extract it
A SQL table with text columns	No	Query directly: SELECT content FROM docs WHERE … then pass strings to the vector store
A SQL table with structured rows (e.g. orders, products)	Situational	SQLDatabaseLoader converts rows → text for semantic search; skip it if the data is better queried with SQL filters
A REST API with a known JSON schema	No	Call the API, extract the fields you care about, format them as strings — no loader needed
Notion, Confluence, Slack, Google Drive	Yes	Community loaders handle OAuth, pagination, and extraction — reinventing this is a waste of a day
Plain text already in S3 or a bucket	Sometimes	If text is pre-extracted, read it directly. Use S3FileLoader only if you need directory traversal and metadata injection in one step

The skip-loader test

If you can write a SQL query or API call that returns the text you need as a string in under five minutes, skip the loader. DocumentLoader earns its keep when extraction is the hard part — binary formats, auth flows, pagination, multi-page documents.

The Document Object

Every loader returns the same type: a list of Document objects. Each Document has exactly two fields — page_content (the extracted text as a string) and metadata (a dict of source information). That uniform shape is why you can swap loaders without touching the rest of the pipeline.

Choosing the Right Loader

Loader selection: data source → recommended loader → key gotcha

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.