Production & Scale/Inference Optimization
Advanced16 min

Model Routing

Route queries to cheaper models only when the math works — after measuring your traffic mix, enabling prompt caching, building real quality gates, and wiring drift detection. This article starts with whether you should route at all, walks through transparent cost math with April 2026 pricing, replaces naive quality checks with real gates, and ends with a 30-day rollout runbook.

Quick Reference

  • Routing only pays off when >40% of traffic is genuinely simple AND monthly LLM spend exceeds ~$1K — below that, prompt caching alone is simpler and often cheaper
  • April 2026 fast-tier picks: GPT-5.4 Nano ($0.20/MTok input) or Gemini 3.1 Flash Lite ($0.25/MTok) — o4-mini is retired as of Feb 2026
  • Router overhead must be <15ms (rules or classifier) — measure routing latency as a first-class metric, not an afterthought
  • Quality gate order: refusal detection (free, <1ms) → schema validation (free, <1ms) → LLM judge sampled at 10-20% of fast-tier responses (~$0.001/call)
  • Fallback rate target 5-15%: above 20% means the router is routing too aggressively to cheap models and cost savings evaporate
  • Monitor tier distribution daily; alert on Jensen-Shannon divergence >0.05 from 30-day baseline — catches category drift before quality degrades
  • Prompt caching (90% savings on cached input tokens, zero classification overhead) is always worth enabling before building a router

Should I Route at All?

Routing adds a permanent maintenance surface

A router means you now own: a classifier that needs training data and periodic retraining, a per-tier eval suite, a drift monitor, a fallback path, and cost accounting per tier. If the traffic volume or cost pressure doesn't justify that overhead, a single model with prompt caching is simpler and often gives you most of the savings anyway.

SituationRecommendationWhy
Monthly LLM spend < $500Skip routing; enable prompt cachingRouter implementation + ongoing maintenance costs more than it saves at low volume
< 40% of queries are genuinely simpleSkip routingMost traffic goes premium anyway — the router adds latency and complexity for marginal savings
Single task type (e.g., all code generation)Skip routingNo meaningful quality difference between tiers on a uniform task distribution
No per-tier eval harness existsBuild eval firstYou cannot verify routing doesn't degrade quality without per-tier measurement
> $1K/mo spend, > 50% simple traffic, eval existsRouteThe cost math works and you can measure the quality impact

Prompt caching cuts input costs by 90% on cached content with zero classification overhead — add a cache_control marker to your system prompt, and the provider handles the rest. If your system prompt is large (>1K tokens) and conversations are multi-turn, caching alone may save more than routing. The batch API (50% off, 24h turnaround) is another simpler lever for non-interactive workloads. Use these first. Routing and caching are complementary — but caching is always day one.

Real project

A team built a three-tier classifier router over two weeks before discovering that 80% of their cost was the 4K-token system prompt re-sent on every turn. Enabling prompt caching saved 45% with one afternoon of work. The router added another 15% on top — but they could have shipped the 45% saving immediately instead of delaying it by a sprint.

Learn this in → Cost Optimization & Caching