Multimodal Pipelines
Building agents that process video, images, and audio in real-time. Frame extraction, vision models, speaker diarization, pipeline orchestration, and cost optimization for multimodal AI systems.
Quick Reference
- →Video processing: extract frames at 1-2 FPS (not every frame), send to a vision model (GPT-5.4, Claude) for scene understanding. Most video is redundant — sample intelligently.
- →Real-time image analysis: camera feeds for document scanning, visual QA, and environment understanding. Resize images to 512-1024px before sending to reduce cost and latency.
- →Audio transcription: real-time STT for live streams, batch Whisper for recordings. Add speaker diarization to know who said what.
- →Pipeline orchestration: use a coordinator that routes each modality to its processor, then merges results into a unified context for the LLM.
- →Cost control: multimodal tokens are 10-50x more expensive than text. Cache results, sample frames, resize images, and set hard budget limits per request.
Video Processing with Vision Models
A 30 FPS video produces 1,800 frames per minute. Sending all of them to a vision model is prohibitively expensive and slow. Most consecutive frames are nearly identical. Sample at 1-2 FPS for general understanding, or use scene-change detection to extract only frames where something visually changed.
After extracting frames, send them to a vision model with a prompt that describes what you need. Batch frames together when possible — GPT-5.4 and Claude can process multiple images in a single request. Include the timestamp with each frame so the model can reference specific moments in its analysis.