Integrations/Specialized Agents
Advanced20 min

Multimodal Agents (Vision & Files)

When to use vision models vs. dedicated parsers, real cost math using Anthropic's actual token formula, how vision fails on financial docs, model tiering for 50–90% cost savings, image generation with gpt-image-1.5, and a 30-day deployment runbook.

Quick Reference

  • Vision models (Claude Opus 4.7, GPT-5.4, Gemini 3.1 Pro) accept images inline — but dedicated OCR tools beat them on uniform, high-volume documents at 10–50× lower cost
  • Anthropic image tokens: width × height ÷ 750. A 512×512 image costs ~349 tokens, not the ~85 tokens in outdated guides
  • Claude Opus 4.7 supports images up to 2576px on the long edge (~4,784 tokens/image) — roughly 3× the token budget of Sonnet 4.6 for the same image
  • DALL-E 3 is deprecated May 12, 2026 — use gpt-image-1.5, which returns b64_json bytes, not a URL
  • Model tiering (cheap classifier → premium only for complex cases) cuts multimodal costs 50–90% on typical enterprise corpora
  • Vision models hallucinate digits and fail at spatial reasoning — cross-validate financial figures with a second model family before downstream use
  • Store file references (S3 URLs, Anthropic file_ids) in LangGraph state, not raw bytes — state is checkpointed after every node
  • LangChain supports both {"type": "image_url"} and {"type": "image", "url": ...} formats — the latter is the newer provider-agnostic standard

When NOT to Use Vision Models

Before writing any multimodal agent code, ask whether a specialized tool would outperform a vision model on your specific task. Vision models are expensive and general-purpose. The tools below are cheap, precise, and purpose-built.

ScenarioDedicated ToolWhy It WinsCost Difference
Extracting text from uniform PDF invoicesAWS Textract / Azure Document IntelligenceZero hallucination on structured forms, returns JSON with field positions10–50× cheaper per page
OCR from printed receipts and formsTesseract (free), Google Cloud Vision OCRPixel-accurate for print text; deterministic output$0 or ~$0.001/page vs $0.004+ per vision call
Object detection in video or photosYOLO v11, OpenCVRuns locally at <1ms/frame; no API cost, no token budgetNear-zero vs $0.004+/image
Signature or checkbox detectionOpenCV + template matchingDeterministic, auditable, and 1000× faster than asking a vision modelFree vs $0.004+/image
Serial numbers off equipmentDedicated OCR + regexSerial numbers have known patterns; vision models invent digits under uncertainty5–20× cheaper
Real project

A fintech team shipped a receipt-processing agent using GPT-4o vision for every receipt. At ~$0.004/receipt × 50,000/month, cost was ~$200/month plus 3-second latency per receipt. They added Tesseract OCR as a first-pass parser. Tesseract handled 92% of receipts instantly and for nearly free. The vision model was only called for the 8% Tesseract failed on (handwriting, torn corners, unusual layouts). Final blended cost: ~$18/month. Latency on the common path dropped from 3s to 80ms.

Learn this in → The parser-first, vision-as-fallback pattern is the default architecture for high-volume document agents.

Multimodal RAG?Exact numberextraction?YESOCR + Summarizetables · exact numbersUnstructured · ~51s/pageNO> 50K pages?or speed critical?YESColPali / ColQwenvision embeddings · no OCR0.39s/page · self-hostedNOVisual layoutqueries?YESVision at Query Timeimages → LLM at retrieval<10K pages · high qualityNOColPali / ColQwendefault · general use

Start with OCR+Summarize for tables · ColPali for speed · vision-at-query-time for layout-heavy corpora

When vision models DO win

Use vision when: the document layout encodes meaning (org charts, flowcharts, architecture diagrams), the content is handwritten or non-standard, format varies wildly across documents, or you need to reason about visual relationships between elements. These are cases where no parser configuration generalizes.