Intermediate12 min

Multimodal Input

Pass images, audio, PDFs, and video to multimodal models using LangChain's standard content blocks. LangChain v1 introduced a provider-agnostic format that works across GPT-4o, Claude, and Gemini — this article covers both the old provider-native format and the new standard, a capability matrix across providers, and a production router that sends each modality to the right model. For deciding when multimodal is the right tool and what it costs, see Multimodal Models.

Quick Reference

→Standard block (image URL): {'type': 'image', 'url': 'https://...'} — works across OpenAI, Anthropic, Gemini
→Standard block (base64 image): {'type': 'image', 'base64': '...', 'mime_type': 'image/jpeg'}
→Standard block (file/PDF): {'type': 'file', 'base64': '...', 'mime_type': 'application/pdf'}
→Standard block (audio): {'type': 'audio', 'base64': '...', 'mime_type': 'audio/wav'}
→Standard block (video): {'type': 'video', 'base64': '...', 'mime_type': 'video/mp4'} — Gemini only
→HumanMessage(content_blocks=[text_block, image_block]) — use content_blocks= for standard format
→Only Gemini 3.1 Pro accepts native video input; extract frames for other providers

Content Blocks: Provider-Native vs Standard

LangChain v1 introduced a standard content block format that normalizes multimodal input across providers. You can still use the old provider-native dict format — it works — but it locks you to one provider. The standard format is portable: the same block structure works with GPT-4o, Claude, and Gemini without modification.

Old format (provider-native, OpenAI-specific) — still works, but non-portable

New standard format — identical code works with GPT-4o, Claude, or Gemini

Provider formats are normalized into standard content_blocks — .text and .content are views over the same data

Old provider-native type	New standard type	Fields
image_url (OpenAI) / image (Anthropic)	image	url or base64 + mime_type
input_audio (OpenAI)	audio	base64 + mime_type (audio/wav, audio/mp3)
document (Anthropic)	file	base64 + mime_type (application/pdf)
— (Gemini only)	video	base64 + mime_type (video/mp4)

When to use provider-native vs standard

Use standard blocks for all new code — they're portable and future-proof. Use the provider-native format only when you need a provider-specific feature that the standard format doesn't expose, such as OpenAI's 'detail' parameter on image_url (high/low/auto tile-based resolution). You can mix both in the same codebase.

Multimodal Input

Content Blocks: Provider-Native vs Standard

Provider Capability Matrix

Image Input

Sign in to read this article