LangChain/Models
Intermediate12 min

Multimodal Input

Pass images, audio, PDFs, and video to multimodal models using LangChain's standard content blocks. LangChain v1 introduced a provider-agnostic format that works across GPT-4o, Claude, and Gemini — this article covers both the old provider-native format and the new standard, a capability matrix across providers, and a production router that sends each modality to the right model. For deciding when multimodal is the right tool and what it costs, see Multimodal Models.

Quick Reference

  • Standard block (image URL): {'type': 'image', 'url': 'https://...'} — works across OpenAI, Anthropic, Gemini
  • Standard block (base64 image): {'type': 'image', 'base64': '...', 'mime_type': 'image/jpeg'}
  • Standard block (file/PDF): {'type': 'file', 'base64': '...', 'mime_type': 'application/pdf'}
  • Standard block (audio): {'type': 'audio', 'base64': '...', 'mime_type': 'audio/wav'}
  • Standard block (video): {'type': 'video', 'base64': '...', 'mime_type': 'video/mp4'} — Gemini only
  • HumanMessage(content_blocks=[text_block, image_block]) — use content_blocks= for standard format
  • Only Gemini 3.1 Pro accepts native video input; extract frames for other providers

Content Blocks: Provider-Native vs Standard

LangChain v1 introduced a standard content block format that normalizes multimodal input across providers. You can still use the old provider-native dict format — it works — but it locks you to one provider. The standard format is portable: the same block structure works with GPT-4o, Claude, and Gemini without modification.

Old format (provider-native, OpenAI-specific) — still works, but non-portable
New standard format — identical code works with GPT-4o, Claude, or Gemini
Anthropic"thinking"OpenAI"reasoning"Google"citations"LangChain v1 Normalization.content_blocks — same API regardless of provider.content_blockstext · reasoning · citationtool_call · image · ….textconcatenatedtext only.content (legacy)still works · set v1 env varto replace with blocks

Provider formats are normalized into standard content_blocks — .text and .content are views over the same data

Old provider-native typeNew standard typeFields
image_url (OpenAI) / image (Anthropic)imageurl or base64 + mime_type
input_audio (OpenAI)audiobase64 + mime_type (audio/wav, audio/mp3)
document (Anthropic)filebase64 + mime_type (application/pdf)
— (Gemini only)videobase64 + mime_type (video/mp4)
When to use provider-native vs standard

Use standard blocks for all new code — they're portable and future-proof. Use the provider-native format only when you need a provider-specific feature that the standard format doesn't expose, such as OpenAI's 'detail' parameter on image_url (high/low/auto tile-based resolution). You can mix both in the same codebase.