Model Selection Framework
How to choose the right model for your workload — and when not to add routing at all. Covers the 2026 model landscape, scorecard evaluation, real cost math, fallback chains, and a first-30-days runbook for moving from a single model to a production router.
Quick Reference
- →Four decision gates before building a router: latency, compliance, single-task, and cost priority
- →Budget tier (<$0.30/1M input): GPT-5.4 Nano ($0.20), Gemini 3.1 Flash ($0.10), DeepSeek V3.2 ($0.14)
- →Production tier: GPT-5.4 ($2.50/$15), Gemini 3.1 Pro ($2/$12), Sonnet 4.6 ($3/$15)
- →Frontier tier: Claude Opus 4.7 ($5/$25) — agentic tasks, complex reasoning only
- →Router overhead: ~$0.44 per 1K requests (o4-mini classifier at 400 tokens avg)
- →Cascade pattern: try cheap model first, escalate only when quality check fails
- →Fallback chains: fix your model IDs now — claude-sonnet-4-20250514 retires June 15, 2026
- →Re-baseline routing logic quarterly — model updates silently shift quality boundaries
In this article
- 1.Should You Use Multi-Model Routing at All?
- 2.The 2026 Model Landscape: Tiers, Pricing, and Trade-offs
- 3.Building a Model Scorecard
- 4.Multi-Model Strategies with Real Cost Math
- 5.Building a Model Router
- 6.Fallback Chains and Provider Resilience
- 7.Monitoring Router Accuracy and Model Drift
- 8.First 30 Days: From Single Model to Production Router
- ★Best Practices
- ✓Key Takeaways
Should You Use Multi-Model Routing at All?
Most teams add model routing before they should. Routing adds a classification latency hop (~200–400ms), a new failure mode, prompt compatibility surface area across providers, and ongoing maintenance. Before building a router, check four exit conditions. If any one is true, there's a simpler path.
A router only pays off when none of these simpler exits apply
| Exit condition | Simpler path | Routing still makes sense when... |
|---|---|---|
| Latency < 500ms required | Pick one fast model (o4-mini, Gemini Flash) — skip the classifier hop | Async processing or batch jobs where throughput matters more than per-call latency |
| PII / compliance / data residency | Self-host Llama 4 or DeepSeek V3.2 — all data stays on your infra | You need cloud scale AND compliance, so private-cloud routing is an option |
| Single task type all day | Hardcode one model per task — no classifier needed | Your product has 5+ distinct task types with different complexity distributions |
| Cost is top priority, quality is flexible | Cascade: try GPT-5.4 Nano first, escalate if a heuristic quality check fails | You need maximum quality on a slice of requests that you can't define in advance |
A classifier call adds 200–400ms of latency and costs ~$0.44 per 1K requests (400-token classification prompt at o4-mini pricing). That overhead is only worth it if the routing saves more than $0.44/1K in model costs. At GPT-5.4 Nano prices ($0.20/1M input), you'd need to divert at least 2.2M input tokens to break even. Do that math before you build.
A team building an internal Q&A tool routed every query through a complexity classifier. After 3 months: 94% of requests went to the same model anyway (GPT-5.4 Nano), 5% failed the quality heuristic and escalated, 1% hit edge cases the classifier got wrong. They removed the router, hardcoded GPT-5.4 Nano with a human-in-the-loop escalation button, and cut P99 latency by 380ms.
Learn this in → Model routing earns its complexity only when the traffic distribution justifies it.
The 2026 Model Landscape: Tiers, Pricing, and Trade-offs
As of April 2026, the model landscape has four meaningful tiers. The lines between tiers matter more than the specific models — models shift prices and capabilities frequently. Use this as a map, not a contract.
Input cost ($/1M tokens) vs capability tier — most production tasks live in Mid-Range
| Tier | Models (April 2026) | Input $/1M | Use when |
|---|---|---|---|
| Budget | GPT-5.4 Nano, Gemini 3.1 Flash, DeepSeek V3.2 | $0.10–$0.28 | Classification, extraction, summarization, high-volume simple tasks |
| Mid-range | GPT-5.4 Mini, o4-mini | $0.75–$1.10 | Code generation, multi-step analysis, tasks where budget tier misses consistently |
| Production | GPT-5.4, Gemini 3.1 Pro, Sonnet 4.6 | $2.00–$3.00 | Complex reasoning, nuanced instruction-following, agent planning |
| Frontier | Claude Opus 4.7, o3, GPT-5.4 Pro | $5.00–$30.00 | Agentic workflows, frontier coding, tasks where everything else fails |
Don't think in terms of 'best model' — think in terms of 'cheapest model that meets my quality bar.' Measure that bar against your actual data. A task where GPT-5.4 Nano scores 96% accuracy is not a candidate for Sonnet 4.6, regardless of what the benchmarks say about reasoning capability.
Privacy and data residency are a tier-selection constraint, not an afterthought. If your workload processes PII, healthcare data, or content subject to GDPR, your model choice is bounded before cost enters the picture. Self-hosted options (Llama 4, DeepSeek V3.2 via private cloud) sit at budget-tier pricing but require your team to own inference infrastructure.
Claude Opus 4.7 uses a tokenizer that can produce up to 35% more tokens for the same input text compared to Opus 4.6. If you migrated from Opus 4.6, your actual costs at identical pricing ($5/$25 per 1M) may be 0–35% higher depending on your content type. Measure your token counts before budgeting.
Building a Model Scorecard
A model scorecard is a repeatable evaluation of candidate models against your actual task. Never select a model based on public benchmarks alone — they measure different distributions than your production data. A 50-100 case hand-labeled eval set gives you actionable signal in under a day.
50–100 cases is enough to rank models reliably. Compose it as: 60% representative easy cases, 30% hard cases where you expect the cheap model to struggle, 10% edge cases from past production failures. This distribution reveals the accuracy cliff — the point where a cheaper model starts failing on your actual hard inputs.
Once you have scorecard results, the selection question becomes: at what accuracy threshold does the cost difference justify the upgrade? If GPT-5.4 Nano scores 94% and o4-mini scores 97%, and your task tolerance allows 6% errors, stop there. If it doesn't, measure the cost of an error (support ticket, failed transaction, bad output) and compare it against the incremental model cost.
Multi-Model Strategies with Real Cost Math
The right multi-model strategy depends on whether you can classify request complexity ahead of time. If you can — use a router. If you can't, or classification is too expensive — use a cascade. The savings are real but they only pay off at scale.
Classifier cost: $440 · Nano tier: $287 · GPT-5.4 tier: $1,500 · Total: $2,227
The diagram above uses real April 2026 pricing. Scenario: 1M requests/month, 800 input tokens + 200 output tokens on average. Single model (GPT-5.4 all requests): $5,000/month. Routed (70% GPT-5.4 Nano, 30% GPT-5.4, classifier overhead): $2,227/month — a 55% reduction. The classifier itself costs $440/month; if it routes fewer than 300K requests to the cheaper model, it costs more than it saves.
| Pattern | How it works | When it pays off | Complexity |
|---|---|---|---|
| Router | Classifier model labels request complexity, routes to appropriate tier | Traffic has a measurable easy/hard split; classification is cheaper than the savings | Medium — classifier adds latency + failure mode |
| Cascade | Run cheap model, check quality heuristic, escalate on fail | You can't reliably classify upfront but quality is measurable after the fact | Low — sequential, no classifier needed |
| Task-specific hardcode | Different models wired to different task endpoints | Distinct task types with known complexity; no ambiguity at routing time | Very low — no runtime routing logic |
| Confidence-based | Cheap model generates + self-evaluates confidence, escalate below threshold | Model self-assessment correlates with actual quality on your task | High — requires prompt engineering + validation loop |
Before building a classifier, try cascade: run GPT-5.4 Nano or Gemini 3.1 Flash first, then check the output against a lightweight quality heuristic (length, format validation, entity presence). Escalate only on failure. This adds one sequential call but eliminates the classifier entirely. Many teams find the cascade handles 80%+ of traffic and the implementation takes a day, not a week.
Building a Model Router
A router uses a cheap model to classify incoming requests and dispatches them to the appropriate tier. The classifier call must be cheaper than the savings it enables — that constraint determines which tasks are worth routing.
When the classifier returns an unexpected value, falling back to the most expensive model looks safe but makes your costs unbounded in error cases. Default to the mid tier — it's capable enough to handle almost anything, and the cost exposure is controlled. Log parse failures; if they exceed 2–3% of traffic, your classifier prompt needs rework.
A team's classifier started routing 40% of requests to the PRODUCTION tier after three months, up from 12% at launch. Investigation: GPT-5.4's system prompt had been updated to include a complex multi-paragraph context that the classifier now consistently scored as 'complex reasoning.' The model hadn't changed — the prompt had. Router accuracy drifted because the input distribution drifted. Fix: add production tier rate to the daily monitoring dashboard with a >20% alert threshold.
Learn this in → Router drift comes from prompt changes, not just model changes.
Fallback Chains and Provider Resilience
API providers go down, rate-limit burst traffic, and deprecate model IDs without automatic migration. A fallback chain retries across providers so your application stays operational. The chain adds complexity — design it once and maintain it explicitly.
Anthropic announced on April 14, 2026 that claude-sonnet-4-20250514 and claude-opus-4-20250514 retire on June 15, 2026. API calls to these IDs will return errors after that date — there is no automatic failover. Migrate to claude-sonnet-4-6 and claude-opus-4-7 respectively. Production prompts migrate without changes; the 4.x API is compatible across generations.
When your fallback goes from Claude to OpenAI: (1) structured output schemas differ — Claude uses tool_use blocks, OpenAI uses json_schema; (2) system prompt behavior differs — Claude processes system as a top-level field, OpenAI embeds it in messages; (3) token counts diverge by 10–30% for the same text. Test every model in your chain against your full prompt suite. A fallback that produces malformed output is worse than an error.
Monitoring Router Accuracy and Model Drift
A router that worked at launch will drift. Three forces cause it: (1) your prompt changes, shifting the input distribution the classifier sees; (2) model updates change quality boundaries between tiers; (3) your product evolves and introduces request types the classifier has never seen. None of these are visible without explicit monitoring.
| Signal | Alert threshold | Likely cause | Action |
|---|---|---|---|
| Production tier % | > 20% of traffic | Prompt became more complex, classifier drift, model update | Re-run classifier eval on a sample of recent traffic |
| Budget tier % | < 50% of traffic | Over-routing, prompt jailbreak, distribution shift | Sample budget-tier failures, check quality heuristic |
| Fallback usage | > 1% of requests | Primary provider degraded, rate-limit creep | Check provider status page, adjust retry backoff |
| Quality failure rate | > 5% on any tier | Model update changed output quality, prompt regression | Re-run full scorecard against current model versions |
Model providers push silent updates. GPT-5.4 in January is not GPT-5.4 in October. Set a recurring calendar reminder to re-run your scorecard and spot-check routing distributions every quarter. A task where GPT-5.4 Nano used to score 89% may score 94% after a model update — meaning you can move it down a tier and save cost.
First 30 Days: From Single Model to Production Router
Each step below unlocks the next. Don't skip to step 4 — the eval set from step 2 is what tells you whether your router is working in step 5.
Day 1–2: Fix your model IDs
Audit every hardcoded model string in your codebase. Replace claude-sonnet-4-20250514 and claude-opus-4-20250514 (retiring June 15, 2026) with claude-sonnet-4-6 and claude-opus-4-7. While you're there, add a PRICING dict constant (see scorecard section) so costs are visible in one place.
Day 3–5: Build your eval set
Hand-label 50–100 production requests with expected outputs. Aim for 60% easy / 30% hard / 10% edge cases. This is the foundation everything else builds on — skip it and you're flying blind.
Day 6–8: Run the scorecard
Run the scorecard harness against at least 3 models: your current model, the cheapest model in the tier below, and GPT-5.4 Nano or Gemini 3.1 Flash. If the cheaper model scores within 3% accuracy of your current model, switch immediately — no routing needed.
Day 9–12: Try cascade before router
If the cheap model falls short, implement a cascade: run it first, apply a lightweight quality heuristic (format check, length gate, entity validation), and escalate only on failure. Measure what percentage of requests escalate. If it's < 15%, the cascade is your production strategy.
Day 13–18: Build the router only if cascade escalates > 15%
Implement the classifier-based router from this article. Run it against your eval set with the cheap model as budget tier and your current model as production tier. Measure classification accuracy on your eval set — target >90% before shipping.
Day 19–22: Add fallback chain and fix prompt compatibility
Test every prompt in your chain against every model in the fallback. Pay special attention to structured output schemas and system prompt behavior. Log fallback events. Set a >1% fallback rate alert.
Day 23–30: Add monitoring and set re-baseline reminder
Add the RouterMonitor from this article. Set up the four alert thresholds. Put a quarterly re-baseline reminder in your team calendar. You're done.
Best Practices
Do
- ✓Check all four exit conditions (latency, compliance, single-task, cost) before building a router — simpler alternatives exist for each
- ✓Build a task-specific eval set of 50–100 labeled cases before comparing models
- ✓Start with cascade (cheap model first, escalate on quality fail) before building a classifier
- ✓Record actual input and output token counts from the API — pricing estimates based on word count are consistently wrong
- ✓Test your prompt against every model in the fallback chain before deploying — structured output schemas and system prompt handling differ between providers
- ✓Log every fallback event and set a >1% fallback rate alert — creeping fallback usage signals provider degradation
- ✓Add tier distribution to your daily dashboard with a >20% production-tier alert threshold
- ✓Re-run your scorecard quarterly — silent model updates can move your quality thresholds up or down
- ✓Cap the classifier input to ~1000 characters — you're classifying complexity, not re-reading the full prompt
- ✓Default to mid-tier on classifier parse failures, never frontier — the cost exposure from unbounded escalation is worse than occasional over-capability
Don’t
- ✗Don't build a router if 90%+ of your traffic routes to the same tier anyway — measure first
- ✗Don't use benchmark scores as your quality bar — measure accuracy on your actual data
- ✗Don't hardcode a single model without a fallback strategy — providers have outages
- ✗Don't use deprecated model IDs (claude-sonnet-4-20250514 retires June 15, 2026) — calls return errors after that date with no automatic failover
- ✗Don't treat all four tiers as interchangeable via a fallback chain — test prompt compatibility explicitly
- ✗Don't report cost savings without computing the classifier overhead — the break-even point is real
- ✗Don't skip the monitoring step — routers drift silently from prompt changes and model updates
- ✗Don't assume Opus 4.7 costs the same as Opus 4.6 for the same prompts — the new tokenizer can produce 0–35% more tokens
- ✗Don't route to a reasoning model (o3) for tasks that don't need extended thinking — reasoning tokens are 10–100× more expensive than standard tokens
- ✗Don't mix routing tiers with fallback tiers in the same chain — escalation-on-failure is not the same as fallback-on-outage
Key Takeaways
- ✓Check four exit conditions before building a router — latency, compliance, single-task, and cost-priority each have simpler solutions.
- ✓The 2026 budget tier ($0.10–$0.28/1M input) is GPT-5.4 Nano, Gemini 3.1 Flash, and DeepSeek V3.2 — start there and move up only when quality demands it.
- ✓Cascade (cheap model first, escalate on quality failure) is usually faster to ship than a router and handles most mixed-traffic cases.
- ✓A classifier call costs ~$0.44 per 1K requests at current o4-mini pricing — it only pays off if it diverts enough traffic to cover that overhead.
- ✓claude-sonnet-4-20250514 and claude-opus-4-20250514 retire June 15, 2026 — API calls return errors with no automatic failover after that date.
- ✓Monitor tier distribution and re-baseline quarterly — routers drift silently from prompt changes and silent model updates.
Video on this topic
How to pick the right LLM for your app
tiktok