Multimodal Input
Pass images, audio, PDFs, and video to multimodal models using LangChain's standard content blocks. LangChain v1 introduced a provider-agnostic format that works across GPT-4o, Claude, and Gemini — this article covers both the old provider-native format and the new standard, a capability matrix across providers, and a production router that sends each modality to the right model. For deciding when multimodal is the right tool and what it costs, see Multimodal Models.
Quick Reference
- →Standard block (image URL): {'type': 'image', 'url': 'https://...'} — works across OpenAI, Anthropic, Gemini
- →Standard block (base64 image): {'type': 'image', 'base64': '...', 'mime_type': 'image/jpeg'}
- →Standard block (file/PDF): {'type': 'file', 'base64': '...', 'mime_type': 'application/pdf'}
- →Standard block (audio): {'type': 'audio', 'base64': '...', 'mime_type': 'audio/wav'}
- →Standard block (video): {'type': 'video', 'base64': '...', 'mime_type': 'video/mp4'} — Gemini only
- →HumanMessage(content_blocks=[text_block, image_block]) — use content_blocks= for standard format
- →Only Gemini 3.1 Pro accepts native video input; extract frames for other providers
Content Blocks: Provider-Native vs Standard
LangChain v1 introduced a standard content block format that normalizes multimodal input across providers. You can still use the old provider-native dict format — it works — but it locks you to one provider. The standard format is portable: the same block structure works with GPT-4o, Claude, and Gemini without modification.
Provider formats are normalized into standard content_blocks — .text and .content are views over the same data
| Old provider-native type | New standard type | Fields |
|---|---|---|
| image_url (OpenAI) / image (Anthropic) | image | url or base64 + mime_type |
| input_audio (OpenAI) | audio | base64 + mime_type (audio/wav, audio/mp3) |
| document (Anthropic) | file | base64 + mime_type (application/pdf) |
| — (Gemini only) | video | base64 + mime_type (video/mp4) |
Use standard blocks for all new code — they're portable and future-proof. Use the provider-native format only when you need a provider-specific feature that the standard format doesn't expose, such as OpenAI's 'detail' parameter on image_url (high/low/auto tile-based resolution). You can mix both in the same codebase.