Multimodal Models
How modern LLMs process images, audio, and video alongside text. Covers vision capabilities (image understanding, OCR, diagram analysis), audio features, current limitations, practical use cases, and working code examples for extracting structured data from images.
Quick Reference
- →Vision: GPT-5.4/GPT-5, Claude Sonnet 4.6, Gemini 2.5/3.1 all support image input with varying capabilities
- →Images are encoded as visual tokens -- a typical image uses 500-2000 tokens of context
- →OCR quality from vision models now rivals dedicated OCR engines for most document types
- →Audio: Gemini natively processes audio; OpenAI Whisper handles transcription separately
- →Video: Gemini models process video natively; others require frame extraction
- →Use cases: document processing, content moderation, accessibility, visual data extraction
In this article
Vision Capabilities
Vision-capable LLMs can analyze images alongside text, enabling tasks like document understanding, diagram analysis, chart reading, and visual question answering. Under the hood, images are tokenized into visual tokens using a vision encoder (typically a ViT variant), then these visual tokens are processed alongside text tokens in the transformer.
| Model | Max images | Image tokens | Resolution handling | Strengths |
|---|---|---|---|---|
| GPT-5.4 / GPT-5 | Multiple | ~85-1700 per image | Auto-selects detail level | Charts, diagrams, UI screenshots |
| Claude Sonnet 4.6 | Multiple | ~1000-1600 per image | Up to 1568x1568px | Dense document OCR, technical diagrams |
| Gemini 3.1 Pro / 3.1 Pro | 3600+ images | ~250-3000 per image | Flexible | Large image sets, video frames, PDFs |
| Llama 4 Scout | Multiple | Varies | Flexible | Multimodal MoE, self-hostable, 10M context |
Images are expensive in tokens. A single high-resolution image can consume 1000-2000 tokens of your context window. At GPT-5.4 pricing, one image costs about $0.002-0.004 to process. If you are processing thousands of images, this adds up quickly. Consider whether you need high resolution or if lower detail settings suffice.
- ▸Chart and graph reading: models can extract data points, trends, and labels from charts with good accuracy
- ▸UI/screenshot analysis: useful for automated testing, accessibility auditing, and design review
- ▸Document OCR: extracts text from scanned documents, handwriting, and complex layouts
- ▸Diagram understanding: interprets flowcharts, architecture diagrams, and technical schematics
- ▸Photo analysis: describes scenes, identifies objects, reads signs and labels
Audio Capabilities
Audio processing in LLMs is evolving rapidly. Gemini models natively process audio, while OpenAI separates audio through Whisper (transcription) and the Realtime API (voice interaction). The approaches have different trade-offs.
| Approach | Provider | Capabilities | Latency |
|---|---|---|---|
| Native audio input | Gemini 3.1 Pro / 3.1 Pro | Understands speech, music, sounds directly | Low (single model pass) |
| Whisper + LLM | OpenAI | Transcribe first, then process text | Medium (two-step) |
| Realtime API | OpenAI | Voice-to-voice conversation, low latency | Very low (~300ms) |
| Claude voice | Anthropic | Voice mode available in Claude consumer apps | Low |
For most non-real-time applications, transcribing audio with Whisper ($0.006/minute) and then processing the text with your preferred LLM gives better results than native audio processing. You get to choose your best text LLM, the transcript is reusable, and you can inspect/debug the intermediate text. Use native audio only when real-time interaction or acoustic understanding (tone, music) is required.
Current Limitations
Multimodal capabilities are impressive but have significant limitations that you must understand before building production systems around them.
- ▸Spatial reasoning: models struggle with precise spatial relationships ('is the red box above or below the blue circle?')
- ▸Counting: counting objects in images is unreliable beyond ~10 items
- ▸Text in images: while OCR quality is good, small text, unusual fonts, and low contrast reduce accuracy significantly
- ▸Hallucinated image content: models will confidently describe details that are not in the image, especially for ambiguous scenes
- ▸Video understanding: Gemini processes video natively; most others require frame extraction, losing temporal context
- ▸Real-time processing: vision adds 1-3 seconds of latency per image, making real-time video analysis impractical
- ▸Cost: image processing is 5-20x more expensive per 'unit of information' than text processing
Models will describe objects, text, and details in images that do not exist. This is particularly dangerous for document processing -- the model might read a number as $5,000 when it says $50,000. Always validate critical information extracted from images against other sources or with human review.
Production Use Cases
| Use case | Input type | Best model | Accuracy range |
|---|---|---|---|
| Invoice processing | Scanned PDFs | Claude Sonnet 4.6 or Gemini 3.1 Pro | 85-95% field extraction |
| Content moderation | User-uploaded images | GPT-5.4 or Gemini 3 Flash | 90-98% for obvious violations |
| Accessibility (alt text) | Web images | GPT-5.4 | Good quality, needs human review |
| Chart data extraction | Screenshots of charts | Claude Sonnet 4.6 | 70-90% depending on chart type |
| ID verification | Photos of documents | GPT-5.4 or specialized models | 85-95% but use specialized APIs for production |
| UI testing | App screenshots | GPT-5.4 or Claude Sonnet 4.6 | Good for layout, weak for pixel-perfect |
For production document processing, use vision models to augment traditional OCR, not replace it. Run traditional OCR (Tesseract, AWS Textract, Google Document AI) for text extraction, then use the LLM for understanding structure, extracting relationships, and handling edge cases that rule-based systems miss.
Extracting Structured Data from Images
OpenAI's 'detail' parameter lets you choose between 'low' (~85 tokens, 512x512) and 'high' (~1700 tokens, up to 2048x2048). For simple tasks like classification or reading large text, 'low' is sufficient and 20x cheaper. Use 'high' only when you need to read small text or analyze fine visual details.
Best Practices
Do
- ✓Use vision models to augment, not replace, traditional OCR and image processing
- ✓Set image detail to 'low' when high resolution is not needed to save 10-20x on token cost
- ✓Validate critical data extracted from images -- vision hallucinations are common
- ✓Use structured output (JSON mode) when extracting data from images for reliable parsing
- ✓Consider the transcribe-first approach for audio: Whisper + text LLM often beats native audio processing
Don’t
- ✗Don't send high-resolution images when low resolution is sufficient for the task
- ✗Don't trust vision models for precise counting, spatial reasoning, or small text extraction without validation
- ✗Don't process video by sending every frame -- sample key frames or use Gemini's native video support
- ✗Don't assume all vision models have equal capabilities -- test with your specific image types
- ✗Don't build safety-critical systems (medical imaging, autonomous driving) on general-purpose vision LLMs
Key Takeaways
- ✓Vision-capable LLMs can analyze images, charts, documents, and screenshots with impressive but imperfect accuracy.
- ✓Images cost 500-2000 tokens each -- use low-detail mode when possible to save significant cost.
- ✓For audio processing, the transcribe-first approach (Whisper + text LLM) usually outperforms native audio input.
- ✓Vision hallucinations are worse than text hallucinations -- always validate extracted data for critical applications.
- ✓Use multimodal models to augment specialized tools (OCR, document AI), not replace them entirely.
Video on this topic
Using AI to read images: what works and what doesn't
tiktok