LLM Foundations/The Model Landscape
Intermediate11 min

Model Selection Framework

A systematic framework for choosing the right LLM for your use case across four dimensions: capability, cost, latency, and privacy. Includes model scorecards, multi-model strategies, fallback chains, and a working model router implementation.

Quick Reference

  • Four dimensions: capability (quality), cost (per token), latency (TTFT + generation speed), privacy (data handling)
  • Build a model scorecard: test each candidate on your actual data, not benchmarks
  • Multi-model strategy: cheap model for routing/simple tasks, expensive model for complex tasks
  • Fallback chains: primary -> secondary -> fallback with automatic failover
  • Always start with the cheapest model and move up only when quality is insufficient
  • Re-evaluate model selection quarterly as new models and pricing are released

The Four Dimensions of Model Selection

Every model selection decision involves trade-offs across four dimensions. No model wins on all four, so your task is to determine which dimensions matter most for your specific use case.

DimensionWhat to measureKey metricExample constraint
CapabilityHow well does the model perform your task?Task-specific accuracyMust achieve >95% extraction accuracy
CostWhat is the per-request and monthly cost?$/1M tokens and $/requestBudget of $5K/month for 10M requests
LatencyHow fast does the response arrive?TTFT + tokens/secondMust respond within 2 seconds for UX
PrivacyWhere does your data go?Data residency, retention policyPII must never leave our infrastructure
The capability-cost frontier

Models sit on a capability-cost curve. o4-mini and Gemini Flash occupy the 'good enough for most tasks at very low cost' position. GPT-5.4 and Claude Sonnet 4.6 are the 'high quality at moderate cost' sweet spot. o3 and Claude Opus 4.6 sit at 'maximum quality at premium cost.' Most applications should start at the cheap end and move up only when quality demands it.

Building a Model Scorecard

A model scorecard is a systematic evaluation of candidate models against your actual use case. Never select a model based on public benchmarks alone -- always test with your data.

Simple model scorecard evaluation
Minimum viable evaluation set

Start with 50-100 hand-labeled test cases that represent your production distribution. Include easy cases, hard cases, and edge cases. This small investment gives you a repeatable way to compare models and catch regressions when you change prompts or switch models.

Multi-Model Strategies

The most cost-effective production systems use multiple models. A cheap model handles 80% of requests (simple tasks), while an expensive model handles the 20% that require higher capability. This can reduce costs by 60-80% compared to using the expensive model for everything.

PatternHow it worksCost savingsComplexity
Router modelSmall model classifies task complexity, routes to appropriate model50-70%Medium
CascadeTry cheap model first, escalate to expensive model if quality is low30-60%Medium
Task-specificDifferent models hardcoded for different task types40-60%Low
Confidence-basedCheap model generates, expensive model re-evaluates low-confidence outputs20-50%High
The 80/20 rule of model routing

In most applications, 80% of requests are simple enough for o4-mini or Gemini Flash (classification, simple extraction, straightforward Q&A). Only 20% genuinely need GPT-5.4 or Claude Sonnet 4.6 (complex reasoning, nuanced analysis, ambiguous instructions). Route accordingly.

Fallback Chains

API providers have outages, rate limits, and degraded performance. A fallback chain automatically redirects to an alternative model when the primary is unavailable, ensuring your application stays operational.

Model fallback chain with automatic failover
Prompt compatibility across models

When using fallback chains, ensure your prompts work well with all models in the chain. Different models respond differently to the same prompt. Test your prompt with every model in the fallback chain and adjust if needed. System prompt behavior, in particular, varies significantly between OpenAI and Anthropic.

Building a Model Router

A model router examines each incoming request and decides which model should handle it. The routing logic can be rule-based, classifier-based, or even LLM-based.

Simple complexity-based model router
Router cost overhead

The classification call adds ~$0.00002 per request with o4-mini. If it saves you from using GPT-5.4 on 80% of requests, the savings are enormous. A classification call that costs $0.00002 and saves $0.01 per routed request pays for itself 500x over.

Best Practices

Best Practices

Do

  • Build a task-specific evaluation set before comparing models -- benchmarks are not your use case
  • Start with the cheapest model and move up only when quality is demonstrably insufficient
  • Implement fallback chains for production resilience across multiple providers
  • Use model routing to match request complexity with model capability and cost
  • Track actual cost, latency, and quality per model in production to inform optimization

Don’t

  • Don't select models based on public benchmarks alone -- always test with your data
  • Don't use the most expensive model for everything -- most requests are simple enough for cheaper models
  • Don't hardcode a single model without a fallback strategy
  • Don't assume model performance is static -- re-evaluate as new models and versions are released
  • Don't forget that prompt engineering can close the gap between a cheap and expensive model

Key Takeaways

  • Model selection has four dimensions: capability, cost, latency, and privacy -- optimize for your specific constraints.
  • Always build a task-specific evaluation set to compare models, not benchmarks.
  • Multi-model strategies (routing, cascading) can reduce costs 50-80% without sacrificing quality.
  • Fallback chains across providers ensure resilience against outages and rate limits.
  • Start cheap, measure quality, and upgrade only the requests that need it.

Video on this topic

How to pick the right LLM for your app

tiktok