LLM Foundations/The Model Landscape
★ OverviewBeginner12 min

Model Families Compared

A comprehensive comparison of the major LLM families: GPT (OpenAI), Claude (Anthropic), Gemini (Google), and leading open models (Llama, Mistral, Qwen). Pricing, capabilities, context windows, and when to use each.

Quick Reference

  • GPT-5.4: recommended production model, strong tool use, 1M context, $2/$8 per 1M tokens
  • Claude Sonnet 4.6: best for long-context, coding, and careful instruction following, 200K context, $3/$15 per 1M tokens
  • Gemini 3.1 Pro: large context window (1M tokens), competitive quality, strong multimodal, $1/$10 per 1M tokens
  • Llama 4 Maverick (400B MoE): best open model, competitive with GPT-5, free for most commercial use
  • o3/o4-mini (OpenAI reasoning models): best for math, science, and complex reasoning
  • Choose based on: task requirements, budget, latency needs, and privacy constraints

GPT Family (OpenAI)

OpenAI's GPT family is the most widely adopted LLM family, with the broadest ecosystem of tools, tutorials, and integrations. The lineup now spans from GPT-5 (the latest flagship) down to the cost-effective o4-mini reasoning model.

ModelContextInput $/1MOutput $/1MBest for
GPT-51M$1.25$10.00Latest flagship: unified reasoning, vision, and tool use
GPT-5.41M$2.00$8.00Recommended production model, replaced GPT-4o
GPT-4o128K$2.50$10.00Still available but superseded by GPT-5.4 and GPT-5
o3200K$2.00$8.00Best reasoning model, 87% cheaper than older reasoning models
o4-mini200K$1.10$4.40Cost-effective reasoning, tunable effort levels
  • GPT-5.4 is the default recommendation for most production applications -- it replaced GPT-4o as the recommended model
  • GPT-5 is the latest flagship model with up to 1M context and unified reasoning capabilities
  • o3 and o4-mini are the current reasoning models -- o1 is now legacy
  • Strong function calling and tool use support across the lineup
  • First-party support for JSON mode, structured outputs (response_format), and vision
When to use o3 vs GPT-5.4

Use o3 when the task requires multi-step reasoning, mathematical proof, or complex analysis. Use GPT-5.4 for everything else -- it is faster, cheaper, and better at straightforward tasks like summarization, extraction, and conversation. The reasoning models are slower by design but o3 is now much more affordable than older reasoning models.

Claude Family (Anthropic)

Anthropic's Claude family emphasizes safety, long-context performance, and careful instruction following. The lineup has evolved significantly — Claude 3.5 Sonnet and Claude 3 Opus are now legacy models, replaced by the Claude 4.x series.

ModelContextInput $/1MOutput $/1MBest for
Claude Opus 4.61M$5.00$25.00Most capable: complex reasoning, coding, analysis, agentic tasks
Claude Sonnet 4.6200K$3.00$15.00Balanced: coding, analysis, long documents, complex instructions
Claude Haiku 4.5200K$1.00$5.00Fast, cost-effective tasks with good quality
  • Claude Opus 4.6 offers a 1M token context window -- the largest in the Claude lineup
  • Claude Sonnet 4.6 is the recommended model for most production use cases, balancing quality and cost
  • Excellent at following complex, multi-constraint instructions without 'forgetting' requirements
  • Prompt caching reduces costs by up to 90% for repeated system prompts; batch processing offers 50% savings
  • Strong refusal behavior -- Claude will decline harmful requests more consistently than competitors
  • Extended thinking: Claude can use internal reasoning tokens for complex tasks, similar to o3
Claude's system prompt handling

Claude treats system prompts as a separate, privileged input with stronger adherence than user messages. This makes Claude particularly good at maintaining personas, following output format requirements, and respecting constraints throughout long conversations.

Gemini Family (Google)

Google's Gemini models have evolved rapidly. Gemini 1.5 Pro and Flash are now shut down, and Gemini 2.0 Flash is deprecated (shutdown June 2026). The current lineup spans from Gemini 2.5 to 3.1, with strong multimodal capabilities and competitive pricing.

ModelContextInput $/1MOutput $/1MBest for
Gemini 3.1 Pro1MLatest pricing — check Google AILatest pricingNewest generation, most capable
Gemini 3.1 Pro1M$1.00$10.00Production-grade, strong reasoning and multimodal
Gemini 3 Flash1M$0.30$2.50Balanced cost and quality, good general-purpose
Gemini 3 Flash-Lite1M$0.10$0.40High-volume tasks, extremely cost-effective
  • 1M token context window across the lineup -- enables processing entire codebases or book-length documents
  • Gemini 3 Flash-Lite is one of the cheapest capable models available -- excellent for high-volume pipelines
  • Native multimodal: processes images, audio, and video in a single model (not separate vision/audio models)
  • Strong integration with Google Cloud, Vertex AI, and Google Workspace
  • Gemini 3.1 series introduces the latest generation with improved reasoning and agentic capabilities
  • Note: Gemini 1.5 models are shut down and Gemini 2.0 Flash is deprecated -- migrate to 2.5+ models
Long context quality vs quantity

While Gemini supports 1M tokens, retrieval accuracy still degrades with very long contexts. For most tasks, you are better off using retrieval to find the relevant 5K-10K tokens than stuffing 500K tokens of context. The 1M window is most useful for tasks that genuinely require understanding the entire document, like codebase analysis or book summarization.

Open Models (Llama, Mistral, Qwen)

Open-weight models have closed the gap with proprietary models dramatically. Llama 4 Maverick is competitive with GPT-5 on many benchmarks, and the open-source ecosystem now includes strong MoE architectures that offer excellent quality-per-dollar.

ModelParametersContextLicenseStrengths
Llama 4 Maverick400B (128 experts, 17B active)1MLlama 4 CommunityMoE architecture, competitive with GPT-5
Llama 4 Scout109B (16 experts, 17B active)10MLlama 4 CommunityLargest context window of any open model (10M tokens)
Qwen 3.5 (72B)72B128KApache 2.0Latest generation, excellent multilingual, strong coding
DeepSeek V3.2671B (MoE)128KMITUnified model replacing V3 and R1, very competitive pricing ($0.28/$0.42 per 1M tokens)
Mistral Large123B128KResearch + CommercialStrong European model, excellent multilingual
Open does not mean free

Self-hosting a 70B model requires significant GPU infrastructure (2x A100 80GB minimum). Factor in hardware, ops, monitoring, and scaling before choosing self-hosting over API providers. For most startups, API providers are cheaper until you hit ~$10K-50K/month in API costs.

  • Llama 4 uses MoE (Mixture of Experts) architecture -- only 17B parameters active per inference, making it very efficient
  • Apache 2.0 models (Qwen 3.5) and MIT-licensed models (DeepSeek V3.2) have the fewest restrictions for commercial use
  • DeepSeek V3.2 unified the V3 base model and R1 reasoning model into a single model
  • Hosted open model APIs (Together, Fireworks, Groq) offer a middle ground: open model quality with API convenience

Head-to-Head Comparison

DimensionGPT-5.4 / GPT-5Claude Sonnet 4.6Gemini 3.1 ProLlama 4 Maverick
General qualityExcellentExcellentExcellentVery good
CodingVery goodExcellentGoodGood
Long contextExcellent (1M)Very good (200K)Excellent (1M)Excellent (1M)
MultimodalVery goodGood (vision)Excellent (native)Good
Instruction followingVery goodExcellentVery goodGood
Speed (TTFT)FastFastFastDepends on hosting
Cost (per 1M in/out)$2/$8 (4.1)$3/$15$1/$10Self-hosted or API
PrivacyCloud onlyCloud onlyCloud onlySelf-host option
Fine-tuningSupportedNot yetSupportedFull control
Don't pick one -- use multiple

Production systems often use multiple models. A common pattern: Gemini 3 Flash-Lite or o4-mini for routing and classification (cheap, fast), Claude Sonnet 4.6 or GPT-5.4 for complex tasks (high quality), and a self-hosted Llama 4 model for privacy-sensitive data. Build your architecture to swap models easily.

Pricing changes frequently

LLM pricing drops 2-3x per year. GPT-4 cost $30/$60 per 1M tokens in March 2023. GPT-5.4 costs $2/$8 in early 2026 -- a 15x reduction in under three years. Design your cost models to be flexible, and re-evaluate pricing quarterly.

Best Practices

Best Practices

Do

  • Start with GPT-5.4 or Claude Sonnet 4.6 -- they cover the widest range of tasks well
  • Use the cheapest model that meets quality requirements -- test smaller models first
  • Build model-agnostic architectures that let you swap providers without rewriting code
  • Re-evaluate model selection quarterly -- the landscape changes rapidly
  • Consider multi-model strategies: cheap models for simple tasks, expensive models for complex ones

Don’t

  • Don't assume the most expensive model is always the best for your task
  • Don't commit to a single provider without a migration plan
  • Don't ignore open models -- they are viable for many production use cases
  • Don't choose based on benchmarks alone -- test with your actual data and tasks
  • Don't forget to account for rate limits, availability, and support quality in your selection

Key Takeaways

  • GPT-5.4 is the recommended production model, with GPT-5 as the latest flagship for complex tasks.
  • Claude Sonnet 4.6 excels at coding, long-context tasks, and strict instruction following; Opus 4.6 offers 1M context.
  • Gemini 3.1 Pro and the new 3.1 series offer strong multimodal support — note that 1.5 and 2.0 models are deprecated.
  • Open models (Llama 4, DeepSeek V3.2, Qwen 3.5) are competitive with proprietary models and offer privacy/customization benefits.
  • Production systems should use multiple models for different tasks -- build for model portability from day one.

Video on this topic

GPT vs Claude vs Gemini vs Llama: which should you use?

tiktok