Integrations/Real-Time AI
Advanced14 min

Multimodal Pipelines

Multimodal pipelines add genuine value when layout, speaker identity, or visual content cannot be captured by text extraction alone — and add cost and hallucination risk when they can. This article covers the coordinator pattern, how to compute real costs, and how to defend against the specific failures that take these systems down in production.

Quick Reference

  • GPT-5.4 image tokens: low detail = 85 tokens fixed (any resolution). High detail = ceil(W/32) × ceil(H/32) patches × 1.62 multiplier. A 768px image at high detail = ~933 tokens.
  • Cost math: 1 min of video at 1 FPS = 60 frames. Low detail: 60 × 85 × $2.50/M = $0.013/min. High detail: 60 × 933 × $2.50/M = $0.14/min.
  • Audio: Deepgram Nova-3 and Flux deliver sub-300ms streaming STT. AssemblyAI Universal-3 delivers ~300ms with speaker diarization built in. Whisper V4 adds native streaming and ~3.2% WER.
  • Coordinator pattern: raw input → modality router → parallel processors (asyncio.gather) → MultimodalContext → reasoning LLM. The LLM sees merged text, never raw media.
  • Tiering: use GPT-5.4 Nano (~$0.21/1K images) as a cheap classifier; escalate only complex frames to GPT-5.4 (~$2.33/1K images at high detail). Reduces cost by 10× on simple scenes.
  • When NOT to build a multimodal pipeline: if PDF parsing, OCR, or structured extraction can give the LLM what it needs, do that instead. Multimodal adds hallucination risk and cost without adding value.

When to Build a Multimodal Pipeline

The most common mistake with multimodal AI is building it when you don't need it. If your input is a PDF, parse it to text. If it's a scanned form with known field positions, use OCR with a template. Multimodal pipelines are the right tool when the visual layout, speaker identity, or raw audio timing is itself the signal — not when it's just a delivery vehicle for text.

Multimodal RAG?Exact numberextraction?YESOCR + Summarizetables · exact numbersUnstructured · ~51s/pageNO> 50K pages?or speed critical?YESColPali / ColQwenvision embeddings · no OCR0.39s/page · self-hostedNOVisual layoutqueries?YESVision at Query Timeimages → LLM at retrieval<10K pages · high qualityNOColPali / ColQwendefault · general use

Start with OCR+Summarize for tables · ColPali for speed · vision-at-query-time for layout-heavy corpora

Text-first is almost always cheaper and more accurate

A vision model analyzing a structured invoice introduces hallucination risk on every field it reads. A PDF parser extracting the same invoice returns exact text deterministically at zero per-token cost. Only reach for multimodal when the document structure itself (charts, diagrams, handwriting, spatial layout) is the signal you need, not when text extraction is just harder to implement.

  • Use multimodal when: the content is a diagram, chart, or visual that cannot be meaningfully extracted as text — a flowchart, a screenshot, an architectural drawing.
  • Use multimodal when: speaker identity matters — a meeting transcript where you need 'what did the customer say about pricing?' instead of a flat wall of text.
  • Use multimodal when: the document contains handwriting, complex tables with merged cells, or visual layouts where position encodes meaning.
  • Use text extraction when: the document is machine-generated PDF, a web page, or any content where `pdfplumber`, `pypdf`, or BeautifulSoup can give you clean text. OCR is faster and cheaper than vision models for this.