Model Families Compared
The hard part of model selection isn't capability — all frontier models are good. It's knowing which model to pick for your task, budget, and compliance constraints. This guide maps every major family to the decisions that actually matter in production.
Quick Reference
- →GPT-5.4 ($2.50/$15 per 1M): broad production tasks; GPT-5.4 Mini ($0.75/$4.50) covers ~70% of those at 3x lower cost
- →Claude Opus 4.7 ($5/$25 per 1M, 1M ctx): best for complex agentic coding and long-context analysis
- →Claude Sonnet 4.6 ($3/$15 per 1M): balanced coding + instruction following; Haiku 4.5 ($1/$5) for speed
- →Gemini 3.1 Pro ($2/$12 per 1M, 1M ctx): strong multimodal; Flash-Lite ($0.25/$1.50) for volume pipelines
- →DeepSeek V3.2 ($0.28/$0.42 per 1M, MIT): best cost-per-quality ratio among API models
- →Llama 4 Maverick (128 experts, 17B active, 1M ctx): best open-weight model for self-hosted deployments
- →For privacy/compliance: self-host Llama 4 or DeepSeek V3.2 — no cloud option works here
- →Start expensive, optimize down: validate quality with a flagship model before switching to a cheaper one
In this article
The Selection Problem
In 2024, model selection was mostly a capability question — only a few models could do the job well. In 2026, that's no longer the case. GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and DeepSeek V3.2 are all production-capable for most tasks. The selection problem has shifted from 'which model is good enough' to 'which model is right for my cost, latency, compliance, and context requirements.' Getting this wrong doesn't mean building something that doesn't work — it means paying 5–10x more than necessary, or hitting a compliance wall six months after launch.
Build and validate your feature with GPT-5.4 Standard or Claude Sonnet 4.6. Once it works and you have eval data, try GPT-5.4 Mini or DeepSeek V3.2 for the same tasks. You will be surprised how often the cheaper model is good enough — and the quality bar is now defined by data, not intuition.
Four Decision Axes
Before picking a model, answer these four questions. They narrow the field faster than any benchmark table.
- ▸Budget: What is your input cost ceiling per 1M tokens? If under $1, you are in Mini/Flash-Lite/DeepSeek territory. If under $3, production tier. If cost is secondary, consider flagship or Opus.
- ▸Task complexity: Does the task require multi-step reasoning, long chains of logic, or precise code generation? If yes, use a reasoning model (o3, Opus 4.7, extended thinking on Sonnet). If no, a mid-range model will match quality at a fraction of the cost.
- ▸Compliance / privacy: Does your data leave the country, or is there a hard requirement for EU data residency or self-hosting? If yes, you are limited to Vertex AI (Gemini), EU-region APIs, Mistral (France), or self-hosted open models. No cloud-only API covers every compliance case.
- ▸Context window: Do you need to process documents longer than 200K tokens? GPT-5.4, Gemini 3.1 Pro, and Opus 4.7 all offer 1M-token windows. Llama 4 Scout offers 10M. But long context alone does not guarantee retrieval quality — see the Gemini section.
Pick your quadrant, then tune down to cheaper model once quality is verified
GPT Family (OpenAI)
OpenAI's lineup spans from sub-cent-per-call Nano to the reasoning-first o3. The key shift in 2026 is that GPT-5.4 Standard is rarely the right default — GPT-5.4 Mini matches its quality on most general tasks at one-third the output cost.
| Model | Context | Input $/1M | Output $/1M | Best for |
|---|---|---|---|---|
| GPT-5.4 Nano | 1M | $0.20 | $1.25 | Classification, routing, simple extraction — highest-volume pipelines |
| GPT-5.4 Mini | 1M | $0.75 | $4.50 | General production tasks — replaces GPT-5.4 Standard for ~70% of use cases |
| GPT-5.4 | 1M | $2.50 | $15.00 | Complex generation, tool use, multi-turn conversations |
| GPT-5 | 1M | $1.25 | $10.00 | Latest flagship; unified reasoning + vision + tool use |
| o3 | 200K | $2.00 | $8.00 | Multi-step math, science, and code reasoning |
| o4-mini | 200K | $1.10 | $4.40 | Reasoning tasks on a tighter budget; adjustable reasoning effort |
GPT-5.4 input pricing doubles from $2.50 to $5.00 per 1M tokens once your prompt exceeds 272K tokens. If you are stuffing long documents into context, budget for this jump. For most long-context use cases, Gemini 3.1 Pro ($2/$12 flat up to 1M) is cheaper above 200K tokens.
Use o3 when the task requires a chain of verifiable reasoning steps — proofs, algorithm design, debugging subtle logic errors. Use GPT-5.4 for everything else. o3 is slower by design; GPT-5.4 Mini is faster and cheaper for tasks that do not require deep reasoning.
Avoid GPT models when (1) your data cannot leave US/Azure regions and you have stricter data residency requirements, (2) you need to self-host for compliance or cost — OpenAI has no open-weight option, (3) you need 200K+ context at a flat rate — the 272K pricing cliff makes it uncompetitive for very long documents, or (4) you are doing code-first agentic tasks where Claude Opus 4.7 measurably outperforms.
Claude Family (Anthropic)
Anthropic's Claude 4.x series leads on instruction following, long-context fidelity, and agentic coding. Claude Opus 4.7 — released April 2026 — delivers a 13% improvement on complex coding benchmarks over Opus 4.6 and is now the recommended choice for coding agents. The tokenizer was updated in Opus 4.7, so the same input maps to 1.0–1.35x more tokens than Opus 4.6 — budget for this when estimating costs.
| Model | Context | Input $/1M | Output $/1M | Best for |
|---|---|---|---|---|
| Claude Opus 4.7 | 1M | $5.00 | $25.00 | Complex agentic coding, multi-tool orchestration, long-context analysis |
| Claude Sonnet 4.6 | 200K | $3.00 | $15.00 | Production coding, complex instructions, document analysis |
| Claude Haiku 4.5 | 200K | $1.00 | $5.00 | Fast, cost-effective tasks where Sonnet quality is not required |
Claude models support extended thinking — internal reasoning tokens for complex tasks, similar to o3. Enable it when the task requires multi-step planning or verification. Separately, prompt caching cuts costs by up to 90% for repeated system prompts and up to 50% for batch processing. If your system prompt is large and reused across requests, caching should always be enabled.
Avoid Claude when (1) you need to fine-tune the model — fine-tuning is not yet available on the Claude API, (2) your task is pure structured data extraction at high volume — Haiku is good, but GPT-5.4 Nano at $0.20/1M input is cheaper for simple extraction, (3) you need a self-hosted model — Anthropic has no open weights, or (4) you are hitting Claude's refusal boundaries frequently on your task — consider GPT-5.4 or an open model with fewer restrictions.
Gemini Family (Google)
Google's Gemini lineup spans from Flash-Lite (one of the cheapest capable models at $0.25/$1.50) to Gemini 3.1 Pro (strong multimodal, 1M context). The Gemini 1.5 series is shut down; Gemini 2.0 Flash is deprecated and will be retired June 2026. Migrate to the 2.5+ or 3.x models.
| Model | Context | Input $/1M | Output $/1M | Best for |
|---|---|---|---|---|
| Gemini 3.1 Pro | 1M | $2.00 | $12.00 | Latest generation: strong multimodal, complex reasoning, 1M context |
| Gemini 3 Flash | 1M | $0.50 | $3.00 | Balanced cost and quality for general-purpose production tasks |
| Gemini 3.1 Flash-Lite | 1M | $0.25 | $1.50 | High-volume pipelines where cost per call is the primary constraint |
| Gemini 2.5 Pro | 1M | — | — | Previous generation, still supported — check Google AI docs for current pricing |
Gemini supports 1M tokens, but retrieval accuracy degrades noticeably beyond 50K–100K tokens in practice. For most tasks, retrieve the relevant 5K–10K tokens with RAG rather than stuffing 500K of context. The 1M window is genuinely valuable for tasks that require understanding the entire document holistically — codebase analysis, contract comparison, book-length summarization.
Avoid Gemini when (1) you need strict instruction adherence across multi-turn conversations — Claude and GPT-5.4 are more reliable here, (2) your pipeline requires fine-grained JSON/structured output fidelity — test carefully before committing, (3) you are outside Google Cloud and want to avoid vendor lock-in — Gemini's best capabilities are tied to Vertex AI and Google Workspace, or (4) the task is primarily coding — Gemini 3.1 Pro is competitive but Claude and GPT-5.4 still lead on complex code generation.
Open Models (Llama, DeepSeek, Qwen)
Open-weight models have closed the gap with proprietary APIs significantly. DeepSeek V3.2 at $0.28/$0.42 per 1M tokens is competitive with GPT-5.4 on many tasks at one-ninth the input cost. Llama 4's MoE architecture means only 17B parameters are active per inference pass, making it GPU-efficient despite its total parameter count.
| Model | Architecture | Context | License | Strengths |
|---|---|---|---|---|
| Llama 4 Maverick | 128 experts, 17B active | 1M | Llama 4 Community | Best open-weight general model; matches GPT-4o-level benchmarks |
| Llama 4 Scout | 16 experts, 17B active | 10M | Llama 4 Community | Largest context window of any open model; fits on a single H100 |
| DeepSeek V3.2 | 671B MoE, MIT | 128K | MIT | Best cost-per-quality on hosted APIs ($0.28/$0.42); cache hits $0.028/1M |
| Qwen 3.5 (397B-A17B) | 397B total, 17B active MoE | 262K | Apache 2.0 | Strong multilingual (201 languages), excellent coding, 8x cheaper than non-MoE equivalent |
| Mistral Large | 123B | 128K | Research + Commercial | EU-hosted option; strong multilingual, good for GDPR-constrained workloads |
Running Llama 4 Maverick (128 experts) requires substantial GPU infrastructure — not a single consumer GPU. A well-optimized Maverick deployment needs multiple H100s. At under $10K–$30K/month in API costs, managed APIs (Together AI, Fireworks, Groq) are almost always cheaper than self-hosting. Self-hosting makes sense when: (a) compliance requires it, (b) you need fine-tuning control, or (c) your API costs consistently exceed $30K/month.
- ▸Qwen 3.5-397B-A17B activates only 17B parameters per token — inference cost is similar to a 20B dense model despite 397B total parameters
- ▸DeepSeek V3.2 with prompt caching drops to $0.028/1M input — 100x cheaper than Claude Sonnet for cache-hit requests
- ▸Hosted open model APIs (Together AI, Fireworks, Groq) give you open model quality with API convenience — no infra required
- ▸Apache 2.0 (Qwen) and MIT (DeepSeek) licenses have the fewest commercial restrictions; Llama 4 Community License is permissive but has usage caps at scale
Head-to-Head: Decision Guide
Input cost ($/1M tokens) vs capability tier — most production tasks live in Mid-Range
| Decision Factor | Use GPT-5.4 / Mini | Use Claude Sonnet / Opus | Use Gemini 3.1 Pro | Use Open (Llama 4 / DeepSeek) |
|---|---|---|---|---|
| Code generation | Strong (Mini sufficient for most) | First choice — Opus 4.7 leads on agentic coding | Competitive but not best-in-class | DeepSeek V3.2 competitive; Llama 4 Maverick solid |
| Context > 200K | GPT-5.4 / GPT-5 (1M, watch 272K cliff) | Opus 4.7 (1M); Sonnet 4.6 (200K) | First choice — 1M flat pricing | Llama 4 Scout (10M); Maverick (1M) |
| Multimodal (image/audio/video) | GPT-5.4 + vision supported | Vision supported; no native audio/video | First choice — native multimodal model | Llama 4 natively multimodal |
| Strict instruction following | Very good | First choice — best multi-turn adherence | Good | Varies — test your prompt |
| Privacy / self-host | Not available | Not available | Vertex AI only; no self-host | First choice — full control |
| Cost-sensitive volume (>1M calls/day) | GPT-5.4 Nano ($0.20/1M) or Mini | Haiku 4.5 ($1/1M) | Flash-Lite ($0.25/1M) | DeepSeek V3.2 ($0.28/1M) or hosted Llama |
| Fine-tuning required | Supported | Not yet available | Supported | Full control — any architecture |
GPT-4 input cost $30 per 1M tokens in March 2023. DeepSeek V3.2 is $0.28 today — a 100x drop in input cost in under three years, across different providers. Design your cost models to be flexible. Re-evaluate pricing quarterly; what was cost-prohibitive last quarter may be affordable today.
Multi-Model Architecture
The right answer to 'which model should I use' is almost always 'multiple models.' A routing layer classifies the incoming request, then dispatches to the appropriate model tier. Cheap models handle simple tasks; expensive models handle complex ones; open models handle sensitive data. This pattern reduces costs dramatically without sacrificing quality where it matters.
| Layer | Model | Role | Why |
|---|---|---|---|
| Routing / classification | GPT-5.4 Nano or Gemini Flash-Lite | Classify request type and complexity | Under $0.25/1M — negligible cost for a routing call |
| General tasks | GPT-5.4 Mini or DeepSeek V3.2 | Handle 60–70% of requests | Production quality at mid-range price |
| Complex / agentic | Claude Opus 4.7 or o3 | Multi-step reasoning, coding agents | Only invoked when routing determines it is needed |
| Sensitive data | Self-hosted Llama 4 or DeepSeek V3.2 | Requests that cannot leave your infra | Compliance gate — not a quality fallback |
A document processing pipeline initially used Claude Sonnet 4.6 for all requests — OCR cleanup, entity extraction, and summarization. After adding a routing classifier (GPT-5.4 Nano), 65% of requests were classified as 'simple extraction' and dispatched to GPT-5.4 Mini. Monthly API cost dropped by 58% with no measurable change in output quality, measured by a held-out eval set of 400 document pairs.
Learn this in → This is a common pattern — routing pays for itself in 2–3 days of traffic.
Abstract your LLM calls behind a thin provider layer. When GPT-5.4 pricing changes, or when Claude Opus 4.8 ships, you want to swap models with a config change, not a refactor. The models will change faster than your business logic.
Best Practices
Do
- ✓Start with GPT-5.4 Standard or Claude Sonnet 4.6 to validate quality, then optimize down to Mini / Haiku / Flash-Lite
- ✓Build a routing classifier that dispatches cheap models for simple tasks before reaching the expensive model
- ✓Enable prompt caching on Claude — up to 90% cost reduction for repeated system prompts with no quality change
- ✓Re-evaluate model pricing quarterly — costs drop 2–4x per year; what was unaffordable in Q1 may be affordable in Q3
- ✓Use self-hosted open models (Llama 4, DeepSeek V3.2) for any data that cannot leave your infrastructure
- ✓Test with your actual task data, not just benchmarks — MMLU scores don't predict performance on your specific domain
- ✓Set per-task quality gates before optimizing for cost — you need a baseline eval before you can know if the cheaper model passes
- ✓Use o3 or Opus 4.7 extended thinking for tasks with verifiable correct answers — math, code, structured reasoning
Don’t
- ✗Don't default to GPT-5.4 Standard when GPT-5.4 Mini exists — test Mini first for every new task
- ✗Don't assume the most expensive model is always the best for your task — DeepSeek V3.2 outperforms GPT-5.4 on some benchmarks at one-ninth the input cost
- ✗Don't commit to a single provider without a migration plan — model deprecations happen quarterly
- ✗Don't ignore open models for production — Llama 4 Maverick and DeepSeek V3.2 are production-viable
- ✗Don't use benchmarks as your only selection signal — run evals on your own data before committing
- ✗Don't stuff 500K tokens of context into Gemini and expect perfect recall — long context and long-context retrieval accuracy are different properties
- ✗Don't pick a model before answering the four axes: budget, complexity, compliance, and context requirements
- ✗Don't ignore the GPT-5.4 context pricing cliff at 272K tokens — it can double your per-request cost unexpectedly
Key Takeaways
- ✓GPT-5.4 Mini ($0.75/$4.50) covers ~70% of production use cases that GPT-5.4 Standard handles — always test Mini first.
- ✓Claude Opus 4.7 (April 2026) leads on complex agentic coding with a 13% improvement on coding benchmarks over Opus 4.6; same pricing ($5/$25).
- ✓Gemini 3.1 Pro offers 1M-token context at flat pricing ($2/$12) — the most cost-predictable large-context option above 200K tokens.
- ✓DeepSeek V3.2 ($0.28/$0.42, MIT) delivers GPT-5.4-competitive quality at one-ninth the input cost — the default open-model API choice.
- ✓Qwen 3.5 is a 397B-total / 17B-active MoE model (Apache 2.0), not a 72B dense model — inference costs match a ~20B model.
- ✓Multi-model routing (cheap classifier → tiered models) reduces API costs 50–70% on mixed workloads with no measurable quality loss.
Video on this topic
GPT vs Claude vs Gemini vs Llama: which should you use?
tiktok