Intermediate7 min
Multimodal Input
Pass images, audio, and files to multimodal models using content blocks. Build mixed text-and-image messages with HumanMessage content arrays — no special model class required.
Quick Reference
- →Content blocks: {'type': 'text', 'text': '...'} and {'type': 'image_url', 'image_url': {'url': '...'}}}
- →HumanMessage(content=[text_block, image_block]) for mixed input
- →Pass base64 images: 'data:image/jpeg;base64,...' as the url value
- →Use {'type': 'image_url', 'image_url': {'url': url, 'detail': 'high'}} for detail control (OpenAI)
- →Audio input: {'type': 'input_audio', 'input_audio': {'data': base64, 'format': 'wav'}}
Content Blocks
Multimodal models accept a list of content blocks instead of a plain string. Each block has a type and type-specific fields. LangChain normalizes these across providers — the same structure works with GPT-4o, Claude, and Gemini, with minor provider-specific options.
| Block type | Fields | Providers |
|---|---|---|
| text | type, text | All |
| image_url | type, image_url.url, image_url.detail (optional) | OpenAI, Anthropic, Gemini |
| input_audio | type, input_audio.data (base64), input_audio.format | OpenAI |
| document | type, source.type, source.data, source.media_type | Anthropic (PDF support) |