Multimodal Agents (Vision & Files)
When to use vision models vs. dedicated parsers, real cost math using Anthropic's actual token formula, how vision fails on financial docs, model tiering for 50–90% cost savings, image generation with gpt-image-1.5, and a 30-day deployment runbook.
Quick Reference
- →Vision models (Claude Opus 4.7, GPT-5.4, Gemini 3.1 Pro) accept images inline — but dedicated OCR tools beat them on uniform, high-volume documents at 10–50× lower cost
- →Anthropic image tokens: width × height ÷ 750. A 512×512 image costs ~349 tokens, not the ~85 tokens in outdated guides
- →Claude Opus 4.7 supports images up to 2576px on the long edge (~4,784 tokens/image) — roughly 3× the token budget of Sonnet 4.6 for the same image
- →DALL-E 3 is deprecated May 12, 2026 — use gpt-image-1.5, which returns b64_json bytes, not a URL
- →Model tiering (cheap classifier → premium only for complex cases) cuts multimodal costs 50–90% on typical enterprise corpora
- →Vision models hallucinate digits and fail at spatial reasoning — cross-validate financial figures with a second model family before downstream use
- →Store file references (S3 URLs, Anthropic file_ids) in LangGraph state, not raw bytes — state is checkpointed after every node
- →LangChain supports both {"type": "image_url"} and {"type": "image", "url": ...} formats — the latter is the newer provider-agnostic standard
When NOT to Use Vision Models
Before writing any multimodal agent code, ask whether a specialized tool would outperform a vision model on your specific task. Vision models are expensive and general-purpose. The tools below are cheap, precise, and purpose-built.
| Scenario | Dedicated Tool | Why It Wins | Cost Difference |
|---|---|---|---|
| Extracting text from uniform PDF invoices | AWS Textract / Azure Document Intelligence | Zero hallucination on structured forms, returns JSON with field positions | 10–50× cheaper per page |
| OCR from printed receipts and forms | Tesseract (free), Google Cloud Vision OCR | Pixel-accurate for print text; deterministic output | $0 or ~$0.001/page vs $0.004+ per vision call |
| Object detection in video or photos | YOLO v11, OpenCV | Runs locally at <1ms/frame; no API cost, no token budget | Near-zero vs $0.004+/image |
| Signature or checkbox detection | OpenCV + template matching | Deterministic, auditable, and 1000× faster than asking a vision model | Free vs $0.004+/image |
| Serial numbers off equipment | Dedicated OCR + regex | Serial numbers have known patterns; vision models invent digits under uncertainty | 5–20× cheaper |
A fintech team shipped a receipt-processing agent using GPT-4o vision for every receipt. At ~$0.004/receipt × 50,000/month, cost was ~$200/month plus 3-second latency per receipt. They added Tesseract OCR as a first-pass parser. Tesseract handled 92% of receipts instantly and for nearly free. The vision model was only called for the 8% Tesseract failed on (handwriting, torn corners, unusual layouts). Final blended cost: ~$18/month. Latency on the common path dropped from 3s to 80ms.
Learn this in → The parser-first, vision-as-fallback pattern is the default architecture for high-volume document agents.
Start with OCR+Summarize for tables · ColPali for speed · vision-at-query-time for layout-heavy corpora
Use vision when: the document layout encodes meaning (org charts, flowcharts, architecture diagrams), the content is handwritten or non-standard, format varies wildly across documents, or you need to reason about visual relationships between elements. These are cases where no parser configuration generalizes.