LangChain/Data Pipeline
★ OverviewBeginner9 min

Document Loaders

Loading data from PDF, CSV, Notion, Slack, Google Drive, web pages. The DocumentLoader interface, lazy_load(), aload().

Quick Reference

  • DocumentLoader is the base interface — every loader implements .load() and .lazy_load()
  • lazy_load() yields documents one at a time — essential for large datasets that don't fit in memory
  • aload() is the async variant for non-blocking I/O in async pipelines
  • Each Document has page_content (str) and metadata (dict) with source info
  • Community loaders cover 100+ sources: PDF, CSV, Notion, Slack, Google Drive, web scraping

DocumentLoader Does One Thing

DocumentLoader only extracts text. It does not embed, it does not store. Each step in the pipeline is a separate component — intentionally — so you can swap any piece without touching the rest. Use a different loader, a different embedding model, a different vector DB — the rest stays the same.

1

DocumentLoader — extract

Reads your source (PDF, webpage, CSV) and returns a list of Document objects with page_content and metadata. Knows nothing about embeddings.

2

TextSplitter — chunk

Breaks long documents into smaller overlapping chunks so they fit within the embedding model's context window.

3

EmbeddingModel — vectorize

Turns each chunk of text into a vector of numbers that captures its semantic meaning.

4

VectorStore — store and search

Stores the vectors and enables similarity search. This is where pgvector, Pinecone, or Chroma lives. It accepts vectors you hand it — it cannot read files on its own.

pgvector still needs DocumentLoader

A vector database stores and searches vectors — but it has no idea how to read a PDF or scrape a webpage. DocumentLoader extracts the text first, then you embed it and hand it to the vector store. You always need both.