Advanced14 min

Multimodal Pipelines

Multimodal pipelines add genuine value when layout, speaker identity, or visual content cannot be captured by text extraction alone — and add cost and hallucination risk when they can. This article covers the coordinator pattern, how to compute real costs, and how to defend against the specific failures that take these systems down in production.

Quick Reference

→GPT-5.4 image tokens: low detail = 85 tokens fixed (any resolution). High detail = ceil(W/32) × ceil(H/32) patches × 1.62 multiplier. A 768px image at high detail = ~933 tokens.
→Cost math: 1 min of video at 1 FPS = 60 frames. Low detail: 60 × 85 × $2.50/M = $0.013/min. High detail: 60 × 933 × $2.50/M = $0.14/min.
→Audio: Deepgram Nova-3 and Flux deliver sub-300ms streaming STT. AssemblyAI Universal-3 delivers ~300ms with speaker diarization built in. Whisper V4 adds native streaming and ~3.2% WER.
→Coordinator pattern: raw input → modality router → parallel processors (asyncio.gather) → MultimodalContext → reasoning LLM. The LLM sees merged text, never raw media.
→Tiering: use GPT-5.4 Nano (~$0.21/1K images) as a cheap classifier; escalate only complex frames to GPT-5.4 (~$2.33/1K images at high detail). Reduces cost by 10× on simple scenes.
→When NOT to build a multimodal pipeline: if PDF parsing, OCR, or structured extraction can give the LLM what it needs, do that instead. Multimodal adds hallucination risk and cost without adding value.

When to Build a Multimodal Pipeline

The most common mistake with multimodal AI is building it when you don't need it. If your input is a PDF, parse it to text. If it's a scanned form with known field positions, use OCR with a template. Multimodal pipelines are the right tool when the visual layout, speaker identity, or raw audio timing is itself the signal — not when it's just a delivery vehicle for text.

Start with OCR+Summarize for tables · ColPali for speed · vision-at-query-time for layout-heavy corpora

Text-first is almost always cheaper and more accurate

A vision model analyzing a structured invoice introduces hallucination risk on every field it reads. A PDF parser extracting the same invoice returns exact text deterministically at zero per-token cost. Only reach for multimodal when the document structure itself (charts, diagrams, handwriting, spatial layout) is the signal you need, not when text extraction is just harder to implement.

▸Use multimodal when: the content is a diagram, chart, or visual that cannot be meaningfully extracted as text — a flowchart, a screenshot, an architectural drawing.
▸Use multimodal when: speaker identity matters — a meeting transcript where you need 'what did the customer say about pricing?' instead of a flat wall of text.
▸Use multimodal when: the document contains handwriting, complex tables with merged cells, or visual layouts where position encodes meaning.
▸Use text extraction when: the document is machine-generated PDF, a web page, or any content where `pdfplumber`, `pypdf`, or BeautifulSoup can give you clean text. OCR is faster and cheaper than vision models for this.

Video: Intelligent Frame Extraction

Sample frames, do not process every one

A 30 FPS video produces 1,800 frames per minute. At GPT-5.4 low-detail pricing (85 tokens/frame, $2.50/M), processing every frame costs $0.38/min before a single question is asked. Sample at 1–2 FPS for general understanding, and layer scene-change detection on top to catch abrupt transitions that fixed-rate sampling would miss.

Audio: Transcription and Diarization

Provider / Model	Streaming Latency	Diarization	Best For
Deepgram Nova-3	sub-300ms P50	Yes (built-in)	Live voice agents, real-time captioning
Deepgram Flux	sub-300ms P50	Yes (built-in)	Conversational STT with end-of-turn detection
AssemblyAI Universal-3	~300ms P50	Yes (built-in)	Accurate transcription + analytics in 16 languages
OpenAI gpt-4o-transcribe-diarize	~1s per 30s audio	Yes (built-in)	Diarized batch transcription without external tools
Whisper V4 (self-hosted)	~2–3× real-time on GPU	No (add pyannote 4.0)	Offline, air-gapped, or high-volume batch

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.