Document Loaders
Loading data from PDF, CSV, Notion, Slack, Google Drive, web pages. The DocumentLoader interface, lazy_load(), aload().
Quick Reference
- →DocumentLoader is the base interface — every loader implements .load() and .lazy_load()
- →lazy_load() yields documents one at a time — essential for large datasets that don't fit in memory
- →aload() is the async variant for non-blocking I/O in async pipelines
- →Each Document has page_content (str) and metadata (dict) with source info
- →Community loaders cover 100+ sources: PDF, CSV, Notion, Slack, Google Drive, web scraping
DocumentLoader Does One Thing
DocumentLoader only extracts text. It does not embed, it does not store. Each step in the pipeline is a separate component — intentionally — so you can swap any piece without touching the rest. Use a different loader, a different embedding model, a different vector DB — the rest stays the same.
DocumentLoader — extract
Reads your source (PDF, webpage, CSV) and returns a list of Document objects with page_content and metadata. Knows nothing about embeddings.
TextSplitter — chunk
Breaks long documents into smaller overlapping chunks so they fit within the embedding model's context window.
EmbeddingModel — vectorize
Turns each chunk of text into a vector of numbers that captures its semantic meaning.
VectorStore — store and search
Stores the vectors and enables similarity search. This is where pgvector, Pinecone, or Chroma lives. It accepts vectors you hand it — it cannot read files on its own.
A vector database stores and searches vectors — but it has no idea how to read a PDF or scrape a webpage. DocumentLoader extracts the text first, then you embed it and hand it to the vector store. You always need both.