LLM Foundations/The Model Landscape
Intermediate10 min

Multimodal Models

How modern LLMs process images, audio, and video alongside text. Covers vision capabilities (image understanding, OCR, diagram analysis), audio features, current limitations, practical use cases, and working code examples for extracting structured data from images.

Quick Reference

  • Vision: GPT-5.4/GPT-5, Claude Sonnet 4.6, Gemini 2.5/3.1 all support image input with varying capabilities
  • Images are encoded as visual tokens -- a typical image uses 500-2000 tokens of context
  • OCR quality from vision models now rivals dedicated OCR engines for most document types
  • Audio: Gemini natively processes audio; OpenAI Whisper handles transcription separately
  • Video: Gemini models process video natively; others require frame extraction
  • Use cases: document processing, content moderation, accessibility, visual data extraction

Vision Capabilities

Vision-capable LLMs can analyze images alongside text, enabling tasks like document understanding, diagram analysis, chart reading, and visual question answering. Under the hood, images are tokenized into visual tokens using a vision encoder (typically a ViT variant), then these visual tokens are processed alongside text tokens in the transformer.

ModelMax imagesImage tokensResolution handlingStrengths
GPT-5.4 / GPT-5Multiple~85-1700 per imageAuto-selects detail levelCharts, diagrams, UI screenshots
Claude Sonnet 4.6Multiple~1000-1600 per imageUp to 1568x1568pxDense document OCR, technical diagrams
Gemini 3.1 Pro / 3.1 Pro3600+ images~250-3000 per imageFlexibleLarge image sets, video frames, PDFs
Llama 4 ScoutMultipleVariesFlexibleMultimodal MoE, self-hostable, 10M context
Image token cost

Images are expensive in tokens. A single high-resolution image can consume 1000-2000 tokens of your context window. At GPT-5.4 pricing, one image costs about $0.002-0.004 to process. If you are processing thousands of images, this adds up quickly. Consider whether you need high resolution or if lower detail settings suffice.

  • Chart and graph reading: models can extract data points, trends, and labels from charts with good accuracy
  • UI/screenshot analysis: useful for automated testing, accessibility auditing, and design review
  • Document OCR: extracts text from scanned documents, handwriting, and complex layouts
  • Diagram understanding: interprets flowcharts, architecture diagrams, and technical schematics
  • Photo analysis: describes scenes, identifies objects, reads signs and labels

Audio Capabilities

Audio processing in LLMs is evolving rapidly. Gemini models natively process audio, while OpenAI separates audio through Whisper (transcription) and the Realtime API (voice interaction). The approaches have different trade-offs.

ApproachProviderCapabilitiesLatency
Native audio inputGemini 3.1 Pro / 3.1 ProUnderstands speech, music, sounds directlyLow (single model pass)
Whisper + LLMOpenAITranscribe first, then process textMedium (two-step)
Realtime APIOpenAIVoice-to-voice conversation, low latencyVery low (~300ms)
Claude voiceAnthropicVoice mode available in Claude consumer appsLow
The transcribe-first approach often wins

For most non-real-time applications, transcribing audio with Whisper ($0.006/minute) and then processing the text with your preferred LLM gives better results than native audio processing. You get to choose your best text LLM, the transcript is reusable, and you can inspect/debug the intermediate text. Use native audio only when real-time interaction or acoustic understanding (tone, music) is required.

Current Limitations

Multimodal capabilities are impressive but have significant limitations that you must understand before building production systems around them.

  • Spatial reasoning: models struggle with precise spatial relationships ('is the red box above or below the blue circle?')
  • Counting: counting objects in images is unreliable beyond ~10 items
  • Text in images: while OCR quality is good, small text, unusual fonts, and low contrast reduce accuracy significantly
  • Hallucinated image content: models will confidently describe details that are not in the image, especially for ambiguous scenes
  • Video understanding: Gemini processes video natively; most others require frame extraction, losing temporal context
  • Real-time processing: vision adds 1-3 seconds of latency per image, making real-time video analysis impractical
  • Cost: image processing is 5-20x more expensive per 'unit of information' than text processing
Vision hallucinations are worse than text hallucinations

Models will describe objects, text, and details in images that do not exist. This is particularly dangerous for document processing -- the model might read a number as $5,000 when it says $50,000. Always validate critical information extracted from images against other sources or with human review.

Production Use Cases

Use caseInput typeBest modelAccuracy range
Invoice processingScanned PDFsClaude Sonnet 4.6 or Gemini 3.1 Pro85-95% field extraction
Content moderationUser-uploaded imagesGPT-5.4 or Gemini 3 Flash90-98% for obvious violations
Accessibility (alt text)Web imagesGPT-5.4Good quality, needs human review
Chart data extractionScreenshots of chartsClaude Sonnet 4.670-90% depending on chart type
ID verificationPhotos of documentsGPT-5.4 or specialized models85-95% but use specialized APIs for production
UI testingApp screenshotsGPT-5.4 or Claude Sonnet 4.6Good for layout, weak for pixel-perfect
Vision as augmentation, not replacement

For production document processing, use vision models to augment traditional OCR, not replace it. Run traditional OCR (Tesseract, AWS Textract, Google Document AI) for text extraction, then use the LLM for understanding structure, extracting relationships, and handling edge cases that rule-based systems miss.

Extracting Structured Data from Images

Extract structured data from a receipt image using GPT-5.4
Process multiple images with Anthropic Claude
Image detail levels save cost

OpenAI's 'detail' parameter lets you choose between 'low' (~85 tokens, 512x512) and 'high' (~1700 tokens, up to 2048x2048). For simple tasks like classification or reading large text, 'low' is sufficient and 20x cheaper. Use 'high' only when you need to read small text or analyze fine visual details.

Best Practices

Best Practices

Do

  • Use vision models to augment, not replace, traditional OCR and image processing
  • Set image detail to 'low' when high resolution is not needed to save 10-20x on token cost
  • Validate critical data extracted from images -- vision hallucinations are common
  • Use structured output (JSON mode) when extracting data from images for reliable parsing
  • Consider the transcribe-first approach for audio: Whisper + text LLM often beats native audio processing

Don’t

  • Don't send high-resolution images when low resolution is sufficient for the task
  • Don't trust vision models for precise counting, spatial reasoning, or small text extraction without validation
  • Don't process video by sending every frame -- sample key frames or use Gemini's native video support
  • Don't assume all vision models have equal capabilities -- test with your specific image types
  • Don't build safety-critical systems (medical imaging, autonomous driving) on general-purpose vision LLMs

Key Takeaways

  • Vision-capable LLMs can analyze images, charts, documents, and screenshots with impressive but imperfect accuracy.
  • Images cost 500-2000 tokens each -- use low-detail mode when possible to save significant cost.
  • For audio processing, the transcribe-first approach (Whisper + text LLM) usually outperforms native audio input.
  • Vision hallucinations are worse than text hallucinations -- always validate extracted data for critical applications.
  • Use multimodal models to augment specialized tools (OCR, document AI), not replace them entirely.

Video on this topic

Using AI to read images: what works and what doesn't

tiktok