★ OverviewBeginner10 min

Model Families Compared

The hard part of model selection isn't capability — all frontier models are good. It's knowing which model to pick for your task, budget, and compliance constraints. This guide maps every major family to the decisions that actually matter in production.

Quick Reference

→GPT-5.4 ($2.50/$15 per 1M): broad production tasks; GPT-5.4 Mini ($0.75/$4.50) covers ~70% of those at 3x lower cost
→Claude Opus 4.7 ($5/$25 per 1M, 1M ctx): best for complex agentic coding and long-context analysis
→Claude Sonnet 4.6 ($3/$15 per 1M): balanced coding + instruction following; Haiku 4.5 ($1/$5) for speed
→Gemini 3.1 Pro ($2/$12 per 1M, 1M ctx): strong multimodal; Flash-Lite ($0.25/$1.50) for volume pipelines
→DeepSeek V3.2 ($0.28/$0.42 per 1M, MIT): best cost-per-quality ratio among API models
→Llama 4 Maverick (128 experts, 17B active, 1M ctx): best open-weight model for self-hosted deployments
→For privacy/compliance: self-host Llama 4 or DeepSeek V3.2 — no cloud option works here
→Start expensive, optimize down: validate quality with a flagship model before switching to a cheaper one

In this article

1.The Selection Problem
2.Four Decision Axes
3.GPT Family (OpenAI)
4.Claude Family (Anthropic)
5.Gemini Family (Google)
6.Open Models (Llama, DeepSeek, Qwen)
7.Head-to-Head: Decision Guide
8.Multi-Model Architecture
★Best Practices
✓Key Takeaways

The Selection Problem

In 2024, model selection was mostly a capability question — only a few models could do the job well. In 2026, that's no longer the case. GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and DeepSeek V3.2 are all production-capable for most tasks. The selection problem has shifted from 'which model is good enough' to 'which model is right for my cost, latency, compliance, and context requirements.' Getting this wrong doesn't mean building something that doesn't work — it means paying 5–10x more than necessary, or hitting a compliance wall six months after launch.

Default strategy: start with a flagship, optimize down

Build and validate your feature with GPT-5.4 Standard or Claude Sonnet 4.6. Once it works and you have eval data, try GPT-5.4 Mini or DeepSeek V3.2 for the same tasks. You will be surprised how often the cheaper model is good enough — and the quality bar is now defined by data, not intuition.

Four Decision Axes

Before picking a model, answer these four questions. They narrow the field faster than any benchmark table.

▸Budget: What is your input cost ceiling per 1M tokens? If under $1, you are in Mini/Flash-Lite/DeepSeek territory. If under $3, production tier. If cost is secondary, consider flagship or Opus.
▸Task complexity: Does the task require multi-step reasoning, long chains of logic, or precise code generation? If yes, use a reasoning model (o3, Opus 4.7, extended thinking on Sonnet). If no, a mid-range model will match quality at a fraction of the cost.
▸Compliance / privacy: Does your data leave the country, or is there a hard requirement for EU data residency or self-hosting? If yes, you are limited to Vertex AI (Gemini), EU-region APIs, Mistral (France), or self-hosted open models. No cloud-only API covers every compliance case.
▸Context window: Do you need to process documents longer than 200K tokens? GPT-5.4, Gemini 3.1 Pro, and Opus 4.7 all offer 1M-token windows. Llama 4 Scout offers 10M. But long context alone does not guarantee retrieval quality — see the Gemini section.

Pick your quadrant, then tune down to cheaper model once quality is verified

GPT Family (OpenAI)

OpenAI's lineup spans from sub-cent-per-call Nano to the reasoning-first o3. The key shift in 2026 is that GPT-5.4 Standard is rarely the right default — GPT-5.4 Mini matches its quality on most general tasks at one-third the output cost.

Model	Context	Input $/1M	Output $/1M	Best for
GPT-5.4 Nano	1M	$0.20	$1.25	Classification, routing, simple extraction — highest-volume pipelines
GPT-5.4 Mini	1M	$0.75	$4.50	General production tasks — replaces GPT-5.4 Standard for ~70% of use cases
GPT-5.4	1M	$2.50	$15.00	Complex generation, tool use, multi-turn conversations
GPT-5	1M	$1.25	$10.00	Latest flagship; unified reasoning + vision + tool use
o3	200K	$2.00	$8.00	Multi-step math, science, and code reasoning
o4-mini	200K	$1.10	$4.40	Reasoning tasks on a tighter budget; adjustable reasoning effort

The GPT-5.4 context pricing cliff

GPT-5.4 input pricing doubles from $2.50 to $5.00 per 1M tokens once your prompt exceeds 272K tokens. If you are stuffing long documents into context, budget for this jump. For most long-context use cases, Gemini 3.1 Pro ($2/$12 flat up to 1M) is cheaper above 200K tokens.

When to use o3 vs GPT-5.4

Use o3 when the task requires a chain of verifiable reasoning steps — proofs, algorithm design, debugging subtle logic errors. Use GPT-5.4 for everything else. o3 is slower by design; GPT-5.4 Mini is faster and cheaper for tasks that do not require deep reasoning.

When NOT to use the GPT family

Avoid GPT models when (1) your data cannot leave US/Azure regions and you have stricter data residency requirements, (2) you need to self-host for compliance or cost — OpenAI has no open-weight option, (3) you need 200K+ context at a flat rate — the 272K pricing cliff makes it uncompetitive for very long documents, or (4) you are doing code-first agentic tasks where Claude Opus 4.7 measurably outperforms.

Claude Family (Anthropic)

Anthropic's Claude 4.x series leads on instruction following, long-context fidelity, and agentic coding. Claude Opus 4.7 — released April 2026 — delivers a 13% improvement on complex coding benchmarks over Opus 4.6 and is now the recommended choice for coding agents. The tokenizer was updated in Opus 4.7, so the same input maps to 1.0–1.35x more tokens than Opus 4.6 — budget for this when estimating costs.

Model	Context	Input $/1M	Output $/1M	Best for
Claude Opus 4.7	1M	$5.00	$25.00	Complex agentic coding, multi-tool orchestration, long-context analysis
Claude Sonnet 4.6	200K	$3.00	$15.00	Production coding, complex instructions, document analysis
Claude Haiku 4.5	200K	$1.00	$5.00	Fast, cost-effective tasks where Sonnet quality is not required

Extended thinking and prompt caching

Claude models support extended thinking — internal reasoning tokens for complex tasks, similar to o3. Enable it when the task requires multi-step planning or verification. Separately, prompt caching cuts costs by up to 90% for repeated system prompts and up to 50% for batch processing. If your system prompt is large and reused across requests, caching should always be enabled.

When NOT to use the Claude family

Avoid Claude when (1) you need to fine-tune the model — fine-tuning is not yet available on the Claude API, (2) your task is pure structured data extraction at high volume — Haiku is good, but GPT-5.4 Nano at $0.20/1M input is cheaper for simple extraction, (3) you need a self-hosted model — Anthropic has no open weights, or (4) you are hitting Claude's refusal boundaries frequently on your task — consider GPT-5.4 or an open model with fewer restrictions.

Gemini Family (Google)

Google's Gemini lineup spans from Flash-Lite (one of the cheapest capable models at $0.25/$1.50) to Gemini 3.1 Pro (strong multimodal, 1M context). The Gemini 1.5 series is shut down; Gemini 2.0 Flash is deprecated and will be retired June 2026. Migrate to the 2.5+ or 3.x models.

Model	Context	Input $/1M	Output $/1M	Best for
Gemini 3.1 Pro	1M	$2.00	$12.00	Latest generation: strong multimodal, complex reasoning, 1M context
Gemini 3 Flash	1M	$0.50	$3.00	Balanced cost and quality for general-purpose production tasks
Gemini 3.1 Flash-Lite	1M	$0.25	$1.50	High-volume pipelines where cost per call is the primary constraint
Gemini 2.5 Pro	1M	—	—	Previous generation, still supported — check Google AI docs for current pricing

Long context quality vs. quantity

Gemini supports 1M tokens, but retrieval accuracy degrades noticeably beyond 50K–100K tokens in practice. For most tasks, retrieve the relevant 5K–10K tokens with RAG rather than stuffing 500K of context. The 1M window is genuinely valuable for tasks that require understanding the entire document holistically — codebase analysis, contract comparison, book-length summarization.

When NOT to use the Gemini family

Avoid Gemini when (1) you need strict instruction adherence across multi-turn conversations — Claude and GPT-5.4 are more reliable here, (2) your pipeline requires fine-grained JSON/structured output fidelity — test carefully before committing, (3) you are outside Google Cloud and want to avoid vendor lock-in — Gemini's best capabilities are tied to Vertex AI and Google Workspace, or (4) the task is primarily coding — Gemini 3.1 Pro is competitive but Claude and GPT-5.4 still lead on complex code generation.

Open Models (Llama, DeepSeek, Qwen)

Open-weight models have closed the gap with proprietary APIs significantly. DeepSeek V3.2 at $0.28/$0.42 per 1M tokens is competitive with GPT-5.4 on many tasks at one-ninth the input cost. Llama 4's MoE architecture means only 17B parameters are active per inference pass, making it GPU-efficient despite its total parameter count.

Model	Architecture	Context	License	Strengths
Llama 4 Maverick	128 experts, 17B active	1M	Llama 4 Community	Best open-weight general model; matches GPT-4o-level benchmarks
Llama 4 Scout	16 experts, 17B active	10M	Llama 4 Community	Largest context window of any open model; fits on a single H100
DeepSeek V3.2	671B MoE, MIT	128K	MIT	Best cost-per-quality on hosted APIs ($0.28/$0.42); cache hits $0.028/1M
Qwen 3.5 (397B-A17B)	397B total, 17B active MoE	262K	Apache 2.0	Strong multilingual (201 languages), excellent coding, 8x cheaper than non-MoE equivalent
Mistral Large	123B	128K	Research + Commercial	EU-hosted option; strong multilingual, good for GDPR-constrained workloads

Self-hosting is not free

Running Llama 4 Maverick (128 experts) requires substantial GPU infrastructure — not a single consumer GPU. A well-optimized Maverick deployment needs multiple H100s. At under $10K–$30K/month in API costs, managed APIs (Together AI, Fireworks, Groq) are almost always cheaper than self-hosting. Self-hosting makes sense when: (a) compliance requires it, (b) you need fine-tuning control, or (c) your API costs consistently exceed $30K/month.

▸Qwen 3.5-397B-A17B activates only 17B parameters per token — inference cost is similar to a 20B dense model despite 397B total parameters
▸DeepSeek V3.2 with prompt caching drops to $0.028/1M input — 100x cheaper than Claude Sonnet for cache-hit requests
▸Hosted open model APIs (Together AI, Fireworks, Groq) give you open model quality with API convenience — no infra required
▸Apache 2.0 (Qwen) and MIT (DeepSeek) licenses have the fewest commercial restrictions; Llama 4 Community License is permissive but has usage caps at scale

Head-to-Head: Decision Guide

Input cost ($/1M tokens) vs capability tier — most production tasks live in Mid-Range

Decision Factor	Use GPT-5.4 / Mini	Use Claude Sonnet / Opus	Use Gemini 3.1 Pro	Use Open (Llama 4 / DeepSeek)
Code generation	Strong (Mini sufficient for most)	First choice — Opus 4.7 leads on agentic coding	Competitive but not best-in-class	DeepSeek V3.2 competitive; Llama 4 Maverick solid
Context > 200K	GPT-5.4 / GPT-5 (1M, watch 272K cliff)	Opus 4.7 (1M); Sonnet 4.6 (200K)	First choice — 1M flat pricing	Llama 4 Scout (10M); Maverick (1M)
Multimodal (image/audio/video)	GPT-5.4 + vision supported	Vision supported; no native audio/video	First choice — native multimodal model	Llama 4 natively multimodal
Strict instruction following	Very good	First choice — best multi-turn adherence	Good	Varies — test your prompt
Privacy / self-host	Not available	Not available	Vertex AI only; no self-host	First choice — full control
Cost-sensitive volume (>1M calls/day)	GPT-5.4 Nano ($0.20/1M) or Mini	Haiku 4.5 ($1/1M)	Flash-Lite ($0.25/1M)	DeepSeek V3.2 ($0.28/1M) or hosted Llama
Fine-tuning required	Supported	Not yet available	Supported	Full control — any architecture

LLM pricing drops fast — budget for it

GPT-4 input cost $30 per 1M tokens in March 2023. DeepSeek V3.2 is $0.28 today — a 100x drop in input cost in under three years, across different providers. Design your cost models to be flexible. Re-evaluate pricing quarterly; what was cost-prohibitive last quarter may be affordable today.

Multi-Model Architecture

The right answer to 'which model should I use' is almost always 'multiple models.' A routing layer classifies the incoming request, then dispatches to the appropriate model tier. Cheap models handle simple tasks; expensive models handle complex ones; open models handle sensitive data. This pattern reduces costs dramatically without sacrificing quality where it matters.

Layer	Model	Role	Why
Routing / classification	GPT-5.4 Nano or Gemini Flash-Lite	Classify request type and complexity	Under $0.25/1M — negligible cost for a routing call
General tasks	GPT-5.4 Mini or DeepSeek V3.2	Handle 60–70% of requests	Production quality at mid-range price
Complex / agentic	Claude Opus 4.7 or o3	Multi-step reasoning, coding agents	Only invoked when routing determines it is needed
Sensitive data	Self-hosted Llama 4 or DeepSeek V3.2	Requests that cannot leave your infra	Compliance gate — not a quality fallback

Real project

A document processing pipeline initially used Claude Sonnet 4.6 for all requests — OCR cleanup, entity extraction, and summarization. After adding a routing classifier (GPT-5.4 Nano), 65% of requests were classified as 'simple extraction' and dispatched to GPT-5.4 Mini. Monthly API cost dropped by 58% with no measurable change in output quality, measured by a held-out eval set of 400 document pairs.

Learn this in → This is a common pattern — routing pays for itself in 2–3 days of traffic.

Build for model portability from day one

Abstract your LLM calls behind a thin provider layer. When GPT-5.4 pricing changes, or when Claude Opus 4.8 ships, you want to swap models with a config change, not a refactor. The models will change faster than your business logic.

Best Practices

✓Start with GPT-5.4 Standard or Claude Sonnet 4.6 to validate quality, then optimize down to Mini / Haiku / Flash-Lite
✓Build a routing classifier that dispatches cheap models for simple tasks before reaching the expensive model
✓Enable prompt caching on Claude — up to 90% cost reduction for repeated system prompts with no quality change
✓Re-evaluate model pricing quarterly — costs drop 2–4x per year; what was unaffordable in Q1 may be affordable in Q3
✓Use self-hosted open models (Llama 4, DeepSeek V3.2) for any data that cannot leave your infrastructure
✓Test with your actual task data, not just benchmarks — MMLU scores don't predict performance on your specific domain
✓Set per-task quality gates before optimizing for cost — you need a baseline eval before you can know if the cheaper model passes
✓Use o3 or Opus 4.7 extended thinking for tasks with verifiable correct answers — math, code, structured reasoning

Don’t

✗Don't default to GPT-5.4 Standard when GPT-5.4 Mini exists — test Mini first for every new task
✗Don't assume the most expensive model is always the best for your task — DeepSeek V3.2 outperforms GPT-5.4 on some benchmarks at one-ninth the input cost
✗Don't commit to a single provider without a migration plan — model deprecations happen quarterly
✗Don't ignore open models for production — Llama 4 Maverick and DeepSeek V3.2 are production-viable
✗Don't use benchmarks as your only selection signal — run evals on your own data before committing
✗Don't stuff 500K tokens of context into Gemini and expect perfect recall — long context and long-context retrieval accuracy are different properties
✗Don't pick a model before answering the four axes: budget, complexity, compliance, and context requirements
✗Don't ignore the GPT-5.4 context pricing cliff at 272K tokens — it can double your per-request cost unexpectedly

Key Takeaways

✓GPT-5.4 Mini ($0.75/$4.50) covers ~70% of production use cases that GPT-5.4 Standard handles — always test Mini first.
✓Claude Opus 4.7 (April 2026) leads on complex agentic coding with a 13% improvement on coding benchmarks over Opus 4.6; same pricing ($5/$25).
✓Gemini 3.1 Pro offers 1M-token context at flat pricing ($2/$12) — the most cost-predictable large-context option above 200K tokens.
✓DeepSeek V3.2 ($0.28/$0.42, MIT) delivers GPT-5.4-competitive quality at one-ninth the input cost — the default open-model API choice.
✓Qwen 3.5 is a 397B-total / 17B-active MoE model (Apache 2.0), not a 72B dense model — inference costs match a ~20B model.
✓Multi-model routing (cheap classifier → tiered models) reduces API costs 50–70% on mixed workloads with no measurable quality loss.

Video on this topic

GPT vs Claude vs Gemini vs Llama: which should you use?

tiktok

←

Hallucinations: The Engineering Response

Open vs Closed Models

→