Advanced10 min
Multimodal Agents (Vision & Files)
Building agents that process images, PDFs, and files: vision model integration, document parsing tools, image generation as a tool, and multimodal state management.
Quick Reference
- →Vision models (Claude, GPT-4V) accept images as input — pass screenshots, charts, or photos directly in the message content
- →Use document parsing tools (Unstructured, PyMuPDF) to extract text and tables from PDFs, Word docs, and spreadsheets
- →Image generation tools (DALL-E, Stable Diffusion) let agents create visuals as part of their workflow
- →Multimodal state management: store file references (URLs, S3 keys) in state, not raw binary data
- →Token cost for images is significant — resize and compress images before sending to the vision model to control costs
Vision Model Input
Agents that can see
Vision models (Claude, GPT-4V, Gemini) accept images as message content alongside text. Agents can see screenshots, read charts, analyze photos, and extract information from any visual input. Pass images as URLs or base64-encoded data.
Send images to a vision model via LangChain — URL and base64 approaches