Advanced RAG/Advanced Patterns
Advanced10 min

Multimodal RAG

RAG beyond text: indexing images, tables, and diagrams from documents. PDF processing, multi-vector retrieval, and using vision models for table and image understanding.

Quick Reference

  • Real documents contain text, tables, images, and diagrams — all carry information
  • PDF processing: extract text, tables, and images as separate elements with different embedding strategies
  • Multi-vector retrieval: store text summaries of visual elements as separate vectors alongside raw content
  • Vision models (GPT-5.4, Claude) can describe images and tables, creating text representations for embedding
  • Unstructured.io is the best library for parsing complex documents into structured elements

Beyond Text: The Multimodal Challenge

Most RAG tutorials assume documents are pure text. Real-world documents are multimodal: financial reports have charts and tables, technical docs have architecture diagrams, product manuals have screenshots, and research papers have figures. If you only index the text, you miss critical information stored in these visual elements. A table comparing product features might be the most important part of a document, but a text-only RAG pipeline ignores it entirely.

  • Tables: comparison tables, pricing tables, data tables — structured information that's often the answer to a query
  • Images: architecture diagrams, screenshots, charts, graphs — visual information that text can't fully capture
  • Code blocks: syntax-highlighted code in documents — needs special handling to preserve formatting
  • Equations: mathematical formulas in research papers — LaTeX or rendered images
  • Embedded documents: PDFs with embedded spreadsheets, slides with embedded charts
Two strategies for multimodal content

Strategy 1 (Summarize): Use a vision model to describe images and tables as text, then embed the text description. Simpler, works with any vector store. Strategy 2 (Multi-vector): Embed visual elements with a multimodal embedding model (like CLIP) alongside text embeddings. More complex but preserves visual information better.