Integrations/Specialized Agents
Advanced10 min

Multimodal Agents (Vision & Files)

Building agents that process images, PDFs, and files: vision model integration, document parsing tools, image generation as a tool, and multimodal state management.

Quick Reference

  • Vision models (Claude, GPT-4V) accept images as input — pass screenshots, charts, or photos directly in the message content
  • Use document parsing tools (Unstructured, PyMuPDF) to extract text and tables from PDFs, Word docs, and spreadsheets
  • Image generation tools (DALL-E, Stable Diffusion) let agents create visuals as part of their workflow
  • Multimodal state management: store file references (URLs, S3 keys) in state, not raw binary data
  • Token cost for images is significant — resize and compress images before sending to the vision model to control costs

Vision Model Input

Agents that can see

Vision models (Claude, GPT-4V, Gemini) accept images as message content alongside text. Agents can see screenshots, read charts, analyze photos, and extract information from any visual input. Pass images as URLs or base64-encoded data.

Send images to a vision model via LangChain — URL and base64 approaches