Advanced RAG/Advanced Patterns
Advanced14 min

Multimodal RAG

Real documents contain tables, images, and diagrams — but most teams over-invest in image processing when tables alone deliver 80% of the value. This article covers three strategies (OCR+summarize, ColPali vision embeddings, vision at query time), how vision models fail on financial data, and a first-30-days runbook for incremental deployment.

Quick Reference

  • Run a text-only baseline first — measure the gap before investing in multimodal processing
  • Three strategies: OCR+Summarize (mature, slow), ColPali/ColQwen (fast, no OCR), Vision at Query Time (expensive, highest quality)
  • ColPali embeds entire page images at 0.39s/page vs ~51s/page for Unstructured hi_res — a paradigm shift in 2026
  • Multi-vector pattern: text summaries for search, raw content (HTML tables, image paths) for generation
  • Vision models hallucinate digits ('$5,000' → '$50,000') with full confidence — never trust financial figures without cross-validation
  • Tables deliver 60-80% of multimodal RAG's value over text-only — ship table processing first, add images second
  • Use model tiering: GPT-5.4 Nano ($0.017/1K images) as a classifier, premium models only for complex cases
  • Evaluate on three dimensions: extraction quality, retrieval recall on visual queries, end-to-end answer quality

When You Need Multimodal RAG (and When You Don't)

Most RAG tutorials assume documents are pure text. Real documents are multimodal — financial reports have charts, technical docs have architecture diagrams, product manuals have screenshots, and research papers have figures. But not all visual content carries information absent from the text. The engineering question is not 'can I process images?' but 'does visual content carry information my text pipeline misses?' If the surrounding text already describes what's in a figure, you don't need image processing. A pie chart captioned 'Revenue by region: APAC 42%, EMEA 31%, Americas 27%' adds nothing beyond its caption. An architecture diagram with labeled arrows is fully described by the paragraph above it. But a pricing comparison table, a before/after benchmark chart, or a screenshot of a UI state — those carry information that text-only RAG will miss entirely.

Signal in Your DocumentsMultimodal Needed?Why
Numeric comparison tables (pricing, specs, benchmarks)YESTable structure carries the comparison — text extraction scrambles column relationships
Charts with captions that summarize the dataPROBABLY NOTThe caption often contains all the retrievable information
Architecture diagrams with labeled componentsDEPENDSIf queries target the diagram's content, not just what the diagram 'shows'
Screenshots of UI statesYES (if queried)No text equivalent exists — only relevant if users ask about the UI
Decorative images, logos, stock photosNOZero information beyond surrounding text
Equations rendered as images (research papers)YESLaTeX source may not be present; the equation is the information
Tables with footnotes and merged cellsYESStandard PDF text extraction will destroy the structure
Run a text-only baseline first

Before building any multimodal pipeline, score your text-only RAG on 30+ real user queries. If >80% are answered correctly, the ROI of multimodal processing is likely negative — you're paying for 20% improvement in rare cases. Most teams that skip this baseline waste weeks building image processing that their users never benefit from.

Real project

An insurance company processes 10,000 claims/day. Each claim has damage photos, coverage comparison tables, and body text. Text-only RAG answered 72% of adjuster queries correctly. Adding table processing (no images) raised it to 91%. Adding vision model descriptions of damage photos raised it to 93% — a 2% gain for 4x the cost per document. They shipped tables-only and re-evaluated image processing 6 months later. The 2% improvement still didn't justify the cost at their volume.

Learn this in → Most multimodal value comes from tables. Add image processing only after measuring the incremental gain on real queries.