Model Families Compared
A comprehensive comparison of the major LLM families: GPT (OpenAI), Claude (Anthropic), Gemini (Google), and leading open models (Llama, Mistral, Qwen). Pricing, capabilities, context windows, and when to use each.
Quick Reference
- →GPT-5.4: recommended production model, strong tool use, 1M context, $2/$8 per 1M tokens
- →Claude Sonnet 4.6: best for long-context, coding, and careful instruction following, 200K context, $3/$15 per 1M tokens
- →Gemini 3.1 Pro: large context window (1M tokens), competitive quality, strong multimodal, $1/$10 per 1M tokens
- →Llama 4 Maverick (400B MoE): best open model, competitive with GPT-5, free for most commercial use
- →o3/o4-mini (OpenAI reasoning models): best for math, science, and complex reasoning
- →Choose based on: task requirements, budget, latency needs, and privacy constraints
In this article
GPT Family (OpenAI)
OpenAI's GPT family is the most widely adopted LLM family, with the broadest ecosystem of tools, tutorials, and integrations. The lineup now spans from GPT-5 (the latest flagship) down to the cost-effective o4-mini reasoning model.
| Model | Context | Input $/1M | Output $/1M | Best for |
|---|---|---|---|---|
| GPT-5 | 1M | $1.25 | $10.00 | Latest flagship: unified reasoning, vision, and tool use |
| GPT-5.4 | 1M | $2.00 | $8.00 | Recommended production model, replaced GPT-4o |
| GPT-4o | 128K | $2.50 | $10.00 | Still available but superseded by GPT-5.4 and GPT-5 |
| o3 | 200K | $2.00 | $8.00 | Best reasoning model, 87% cheaper than older reasoning models |
| o4-mini | 200K | $1.10 | $4.40 | Cost-effective reasoning, tunable effort levels |
- ▸GPT-5.4 is the default recommendation for most production applications -- it replaced GPT-4o as the recommended model
- ▸GPT-5 is the latest flagship model with up to 1M context and unified reasoning capabilities
- ▸o3 and o4-mini are the current reasoning models -- o1 is now legacy
- ▸Strong function calling and tool use support across the lineup
- ▸First-party support for JSON mode, structured outputs (response_format), and vision
Use o3 when the task requires multi-step reasoning, mathematical proof, or complex analysis. Use GPT-5.4 for everything else -- it is faster, cheaper, and better at straightforward tasks like summarization, extraction, and conversation. The reasoning models are slower by design but o3 is now much more affordable than older reasoning models.
Claude Family (Anthropic)
Anthropic's Claude family emphasizes safety, long-context performance, and careful instruction following. The lineup has evolved significantly — Claude 3.5 Sonnet and Claude 3 Opus are now legacy models, replaced by the Claude 4.x series.
| Model | Context | Input $/1M | Output $/1M | Best for |
|---|---|---|---|---|
| Claude Opus 4.6 | 1M | $5.00 | $25.00 | Most capable: complex reasoning, coding, analysis, agentic tasks |
| Claude Sonnet 4.6 | 200K | $3.00 | $15.00 | Balanced: coding, analysis, long documents, complex instructions |
| Claude Haiku 4.5 | 200K | $1.00 | $5.00 | Fast, cost-effective tasks with good quality |
- ▸Claude Opus 4.6 offers a 1M token context window -- the largest in the Claude lineup
- ▸Claude Sonnet 4.6 is the recommended model for most production use cases, balancing quality and cost
- ▸Excellent at following complex, multi-constraint instructions without 'forgetting' requirements
- ▸Prompt caching reduces costs by up to 90% for repeated system prompts; batch processing offers 50% savings
- ▸Strong refusal behavior -- Claude will decline harmful requests more consistently than competitors
- ▸Extended thinking: Claude can use internal reasoning tokens for complex tasks, similar to o3
Claude treats system prompts as a separate, privileged input with stronger adherence than user messages. This makes Claude particularly good at maintaining personas, following output format requirements, and respecting constraints throughout long conversations.
Gemini Family (Google)
Google's Gemini models have evolved rapidly. Gemini 1.5 Pro and Flash are now shut down, and Gemini 2.0 Flash is deprecated (shutdown June 2026). The current lineup spans from Gemini 2.5 to 3.1, with strong multimodal capabilities and competitive pricing.
| Model | Context | Input $/1M | Output $/1M | Best for |
|---|---|---|---|---|
| Gemini 3.1 Pro | 1M | Latest pricing — check Google AI | Latest pricing | Newest generation, most capable |
| Gemini 3.1 Pro | 1M | $1.00 | $10.00 | Production-grade, strong reasoning and multimodal |
| Gemini 3 Flash | 1M | $0.30 | $2.50 | Balanced cost and quality, good general-purpose |
| Gemini 3 Flash-Lite | 1M | $0.10 | $0.40 | High-volume tasks, extremely cost-effective |
- ▸1M token context window across the lineup -- enables processing entire codebases or book-length documents
- ▸Gemini 3 Flash-Lite is one of the cheapest capable models available -- excellent for high-volume pipelines
- ▸Native multimodal: processes images, audio, and video in a single model (not separate vision/audio models)
- ▸Strong integration with Google Cloud, Vertex AI, and Google Workspace
- ▸Gemini 3.1 series introduces the latest generation with improved reasoning and agentic capabilities
- ▸Note: Gemini 1.5 models are shut down and Gemini 2.0 Flash is deprecated -- migrate to 2.5+ models
While Gemini supports 1M tokens, retrieval accuracy still degrades with very long contexts. For most tasks, you are better off using retrieval to find the relevant 5K-10K tokens than stuffing 500K tokens of context. The 1M window is most useful for tasks that genuinely require understanding the entire document, like codebase analysis or book summarization.
Open Models (Llama, Mistral, Qwen)
Open-weight models have closed the gap with proprietary models dramatically. Llama 4 Maverick is competitive with GPT-5 on many benchmarks, and the open-source ecosystem now includes strong MoE architectures that offer excellent quality-per-dollar.
| Model | Parameters | Context | License | Strengths |
|---|---|---|---|---|
| Llama 4 Maverick | 400B (128 experts, 17B active) | 1M | Llama 4 Community | MoE architecture, competitive with GPT-5 |
| Llama 4 Scout | 109B (16 experts, 17B active) | 10M | Llama 4 Community | Largest context window of any open model (10M tokens) |
| Qwen 3.5 (72B) | 72B | 128K | Apache 2.0 | Latest generation, excellent multilingual, strong coding |
| DeepSeek V3.2 | 671B (MoE) | 128K | MIT | Unified model replacing V3 and R1, very competitive pricing ($0.28/$0.42 per 1M tokens) |
| Mistral Large | 123B | 128K | Research + Commercial | Strong European model, excellent multilingual |
Self-hosting a 70B model requires significant GPU infrastructure (2x A100 80GB minimum). Factor in hardware, ops, monitoring, and scaling before choosing self-hosting over API providers. For most startups, API providers are cheaper until you hit ~$10K-50K/month in API costs.
- ▸Llama 4 uses MoE (Mixture of Experts) architecture -- only 17B parameters active per inference, making it very efficient
- ▸Apache 2.0 models (Qwen 3.5) and MIT-licensed models (DeepSeek V3.2) have the fewest restrictions for commercial use
- ▸DeepSeek V3.2 unified the V3 base model and R1 reasoning model into a single model
- ▸Hosted open model APIs (Together, Fireworks, Groq) offer a middle ground: open model quality with API convenience
Head-to-Head Comparison
| Dimension | GPT-5.4 / GPT-5 | Claude Sonnet 4.6 | Gemini 3.1 Pro | Llama 4 Maverick |
|---|---|---|---|---|
| General quality | Excellent | Excellent | Excellent | Very good |
| Coding | Very good | Excellent | Good | Good |
| Long context | Excellent (1M) | Very good (200K) | Excellent (1M) | Excellent (1M) |
| Multimodal | Very good | Good (vision) | Excellent (native) | Good |
| Instruction following | Very good | Excellent | Very good | Good |
| Speed (TTFT) | Fast | Fast | Fast | Depends on hosting |
| Cost (per 1M in/out) | $2/$8 (4.1) | $3/$15 | $1/$10 | Self-hosted or API |
| Privacy | Cloud only | Cloud only | Cloud only | Self-host option |
| Fine-tuning | Supported | Not yet | Supported | Full control |
Production systems often use multiple models. A common pattern: Gemini 3 Flash-Lite or o4-mini for routing and classification (cheap, fast), Claude Sonnet 4.6 or GPT-5.4 for complex tasks (high quality), and a self-hosted Llama 4 model for privacy-sensitive data. Build your architecture to swap models easily.
LLM pricing drops 2-3x per year. GPT-4 cost $30/$60 per 1M tokens in March 2023. GPT-5.4 costs $2/$8 in early 2026 -- a 15x reduction in under three years. Design your cost models to be flexible, and re-evaluate pricing quarterly.
Best Practices
Do
- ✓Start with GPT-5.4 or Claude Sonnet 4.6 -- they cover the widest range of tasks well
- ✓Use the cheapest model that meets quality requirements -- test smaller models first
- ✓Build model-agnostic architectures that let you swap providers without rewriting code
- ✓Re-evaluate model selection quarterly -- the landscape changes rapidly
- ✓Consider multi-model strategies: cheap models for simple tasks, expensive models for complex ones
Don’t
- ✗Don't assume the most expensive model is always the best for your task
- ✗Don't commit to a single provider without a migration plan
- ✗Don't ignore open models -- they are viable for many production use cases
- ✗Don't choose based on benchmarks alone -- test with your actual data and tasks
- ✗Don't forget to account for rate limits, availability, and support quality in your selection
Key Takeaways
- ✓GPT-5.4 is the recommended production model, with GPT-5 as the latest flagship for complex tasks.
- ✓Claude Sonnet 4.6 excels at coding, long-context tasks, and strict instruction following; Opus 4.6 offers 1M context.
- ✓Gemini 3.1 Pro and the new 3.1 series offer strong multimodal support — note that 1.5 and 2.0 models are deprecated.
- ✓Open models (Llama 4, DeepSeek V3.2, Qwen 3.5) are competitive with proprietary models and offer privacy/customization benefits.
- ✓Production systems should use multiple models for different tasks -- build for model portability from day one.
Video on this topic
GPT vs Claude vs Gemini vs Llama: which should you use?
tiktok