LangChain/Models
Intermediate7 min

Multimodal Input

Pass images, audio, and files to multimodal models using content blocks. Build mixed text-and-image messages with HumanMessage content arrays — no special model class required.

Quick Reference

  • Content blocks: {'type': 'text', 'text': '...'} and {'type': 'image_url', 'image_url': {'url': '...'}}}
  • HumanMessage(content=[text_block, image_block]) for mixed input
  • Pass base64 images: 'data:image/jpeg;base64,...' as the url value
  • Use {'type': 'image_url', 'image_url': {'url': url, 'detail': 'high'}} for detail control (OpenAI)
  • Audio input: {'type': 'input_audio', 'input_audio': {'data': base64, 'format': 'wav'}}

Content Blocks

Multimodal models accept a list of content blocks instead of a plain string. Each block has a type and type-specific fields. LangChain normalizes these across providers — the same structure works with GPT-4o, Claude, and Gemini, with minor provider-specific options.

Block typeFieldsProviders
texttype, textAll
image_urltype, image_url.url, image_url.detail (optional)OpenAI, Anthropic, Gemini
input_audiotype, input_audio.data (base64), input_audio.formatOpenAI
documenttype, source.type, source.data, source.media_typeAnthropic (PDF support)