LLM Foundations/The Model Landscape
Intermediate14 min

Open vs Closed Models

The false binary of 'pick one side' has been replaced by three deployment tiers in 2026: closed API, hosted open model API, and self-hosted. The quality gap has nearly closed on most tasks. The decision is now mostly about ops budget, volume, and licensing risk — not capability.

Quick Reference

  • The quality gap between top open and top closed models is now single-digit percentage points on most benchmarks
  • Three deployment tiers exist: closed API, hosted open model API (3–10× cheaper, zero ops), and self-hosted
  • Self-hosting beats closed APIs at ~40M tokens/day; beats hosted open APIs only at ~1B tokens/day once ops overhead is included
  • Llama 4 Community License bans EU-domiciled companies from using any Llama 4 model (all are multimodal)
  • Apache 2.0 models with frontier-adjacent quality in 2026: Gemma 4, Qwen 3.5, DeepSeek V3.2, Mistral Small 4
  • SGLang delivers ~29% higher throughput than vLLM on H100s — inference engine choice matters as much as model choice
  • INT8 quantization (AWQ/GPTQ) halves GPU memory requirements with <2% quality loss on most standard tasks

Should You Even Choose?

The 'open vs closed' framing implies a binary decision. In practice, most production AI systems use 2–4 models. A frontier closed model handles the hardest reasoning. A mid-tier model (often closed or hosted open) handles the majority of traffic. A small specialized model handles high-volume classification or embeddings. The right question is not which side to pick — it is which model tier matches which workload.

  • Closed models: highest quality ceiling, zero infrastructure, highest cost per token, no data privacy guarantees
  • Hosted open model APIs (Together, Fireworks, Groq): open model quality at 3–10× lower cost than closed APIs, zero self-hosting ops — the most overlooked tier
  • Self-hosted open models: full control, data stays on your infrastructure, lowest cost at scale, but requires an ML infra team
The middle tier most teams skip

Hosted open model APIs give you open model pricing without a single GPU to manage. If you are paying more than $500/month in closed model API fees and do not have strict on-premises data requirements, you have almost certainly not evaluated this tier seriously enough.

The Quality Gap in 2026

The gap between the best open and best closed models has compressed significantly. As of April 2026, top open models match closed models on knowledge benchmarks and are within single-digit percentage points on most reasoning tasks. The gap is still real on the hardest tasks — complex multi-step reasoning, frontier coding challenges — but it is no longer categorical.

Task typeGap (open vs closed)Notes
Reading comprehension, summarization, extractionEffectively zeroStandard RAG pipelines: open models fully competitive
Code generation (standard)1–3 pointsSWE-bench Verified: ~3pt gap between top open and top closed
Multi-step reasoning3–8 pointsClosed models maintain an edge; gap is shrinking each quarter
Instruction following2–5 pointsBest open models (Llama 4 Maverick, Qwen 3.5) close to parity
Domain-specific tasks (fine-tuned)Open often winsFine-tuned open model on your data routinely beats generic closed model
Benchmarks measure the average case

General benchmark scores do not predict performance on your specific task. Run your own evaluation on a representative 200–500 prompt sample from your actual workload before drawing conclusions. A fine-tuned 70B open model on your domain data will often match or beat a frontier closed model on your specific task while costing 10× less.

FrontierProductionMid-RangeBudget$0$1$2$3$4$5Input cost per 1M tokensClaude Opus 4.7GPT-5Gemini 3.1 ProGPT-5.4Sonnet 4.6DeepSeek V3.2GPT-5.4 MiniFlash-LiteGPT-5.4 Nano$5.00$1.25$2.00$2.50$3.00$0.28$0.75$0.25$0.20

Input cost ($/1M tokens) vs capability tier — most production tasks live in Mid-Range

Three Deployment Models

In 2026 there are three distinct deployment tiers, not two. The middle tier — hosted open model APIs — has matured into a first-class option that most teams underutilize.

more openmore openClosed APIGPT-5.4Claude Sonnet 4.6Gemini 3.1 ProCost$$$OpsNoneControlLowPrivacyProvider SLAHosted Open APITogether AIFireworks AIGroqCost$$OpsNoneControlMediumPrivacyProvider SLASelf-HostedLlama 4Qwen 3.5Gemma 4Cost$OpsFull teamControlFullPrivacyYour infra← more ops · lower cost per token at scale · more control →

Hosted Open API is the overlooked middle tier — open model pricing, zero ops burden

TierExamplesCost (relative)Ops burdenBest for
Closed APIGPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro$$$$NoneHighest-quality tasks, prototyping, when simplicity matters
Hosted Open APITogether AI, Fireworks AI, Groq$$NoneProduction workloads where cost matters but ops expertise is limited
Self-HostedLlama 4, Qwen 3.5, Gemma 4$Full infra team requiredPrivacy/compliance requirements, very high volume (>300M tok/day), maximum control
Default to hosted open model APIs for the first migration

If you are moving away from a closed API for cost reasons, try a hosted open model API first. You get 3–10× cost reduction with zero operational complexity. Self-hosting only makes sense once you have validated quality on the hosted tier AND your volume or privacy requirements justify the ops investment.

Self-Hosting Economics: Real Math

The economics of self-hosting depend on three variables: token volume, model throughput on your GPU, and the often-forgotten cost of engineering time. Using H100 80GB cloud spot at $2/hr (Apr 2026 market rate), a production cluster with 2× H100 minimum for reliability, and 0.1 FTE of ongoing ops overhead:

Daily volumeClosed API/moHosted Open API/moSelf-Hosted/mo (incl. ops)Decision
10M tokens/day~$900~$150~$4,994 (2×H100 + ops)Hosted open wins clearly
50M tokens/day~$4,500~$750~$4,994 (2×H100 + ops)Hosted open wins; self-host beats closed API only
300M tokens/day~$27,000~$4,500~$8,750 (4×H100 + ops)Self-host beats closed; hosted open still competitive
1B tokens/day~$90,000~$15,000~$20,200 (10×H100 + ops)Self-host wins over both
The ops overhead that breaks the math

The table above includes 0.1 FTE of ongoing ops cost (~$1,250/month at a $150k loaded salary). Initial setup costs more: plan for 0.5 FTE of engineering time to configure vLLM or SGLang, set up model serving, monitoring, and load balancing. At $150k/year, that is $6,250/month for the first 3 months. Most self-hosting ROI calculations that look compelling are forgetting this line.

$0$15k$30k$45k$60kcrossover~40M/day↑ $90k10M50M150M500M1BDaily token volumeClosed APIHosted Open APISelf-HostedSelf-hosted incl. 0.1 FTE ops overheadAssumptions: closed $3/1M · hosted $0.50/1M · H100 $2/hr

Closed API vs Self-Hosted crossover ≈ 40M tokens/day · Hosted Open API beats Self-Hosted until ~1B tokens/day

Self-hosting cost estimator (H100 pricing, Apr 2026)

The Licensing Minefield

Not all open models are equally free to use commercially. Licenses range from fully permissive (Apache 2.0, MIT) to restrictive custom licenses that can legally block your entire company. The table below summarizes the current landscape as of April 2026.

← More commercial freedomLess freedom →Apache 2.0 / MITFull commercial freedomGemma 4Qwen 3.5Mixtral 8x22BMistral Small 4DeepSeek V3.2Restricted CustomRead carefullyLlama 4 ScoutLlama 4 Maverick700M+ MAU → need Meta approvalEU companies: multimodal banProprietary APINo access to weightsGPT-5.4Claude Sonnet 4.6Gemini 3.1 Pro

"Open source" ≠ Apache 2.0 — always read the actual license before building production systems

ModelLicenseCommercial useKey restrictions
Gemma 4Apache 2.0Yes, unrestrictedNone — most permissive frontier-adjacent model as of Apr 2026
Qwen 3.5 (open variants)Apache 2.0Yes, unrestrictedQwen 3.6-Plus and above are proprietary; check the specific variant
Mistral Small 4 / Mixtral 8x22BApache 2.0Yes, unrestrictedNone
DeepSeek V3.2MITYes, unrestrictedNone
Llama 4 Scout / MaverickLlama 4 CommunityConditional700M+ MAU requires Meta approval; EU-domiciled companies/individuals: banned from all Llama 4 models (all are multimodal)
Mistral Large 3Mistral CommercialWith commercial licenseContact Mistral for terms; not Apache 2.0
GPT-5.4, Claude 4.x, Gemini 3.1ProprietaryAPI onlyNo access to weights; subject to provider ToS, pricing changes, rate limits, and service discontinuation
The Llama 4 EU ban — a critical gotcha

Llama 4's Community License explicitly prohibits use by individuals domiciled in or companies with a principal place of business in the European Union. This restriction applies to all Llama 4 models because all Llama 4 variants (Scout, Maverick) are multimodal. If your company is EU-domiciled, your engineers have EU addresses, or your principal place of business is in the EU — Llama 4 is legally off-limits without a special agreement from Meta. Use Gemma 4 or DeepSeek V3.2 instead.

Real project

A startup built their medical records summarizer on Llama 4 Maverick after validating quality. Three months before their EU launch, legal review flagged the EU geographic restriction in the Llama 4 Community License — the company's parent entity was registered in Germany. They migrated to Gemma 4 (Apache 2.0) in two weeks, but the unplanned migration cost one sprint and delayed the EU launch by three weeks.

  • Apache 2.0 is the gold standard: unrestricted use, modification, distribution, fine-tuning, and derivative works
  • Always read the actual license text — not a summary — before committing to a model in production
  • Derivative works (fine-tuned models) inherit the base model's license; verify before distributing a fine-tune
  • Some licenses restrict using model outputs as training data — verify before using open model outputs as synthetic data
  • License terms can change between model versions: Llama 3.x and Llama 4 have different licenses

Your Inference Stack

Choosing a model is only half the work. For self-hosted deployments, the inference engine and quantization strategy determine your actual cost and throughput. In 2026 there are two dominant open-source inference engines — vLLM and SGLang — and the choice can affect throughput by 20–30%.

EngineBest forH100 throughput edgeEcosystem
vLLMBroadest hardware support (TPUs, Trainium, Gaudi), OpenAI-compatible API, largest contributor baseBaselineDefault for most cloud API endpoints; 3× more contributors than SGLang
SGLangMulti-turn chat, structured JSON output (3× faster constrained decoding), RAG with shared prefixes+29% vs vLLM on H100Spun out as RadixArk (Jan 2026, ~$400M valuation); powers Grok 3, Cursor, LinkedIn, Azure endpoints
When to pick SGLang

SGLang's RadixAttention caches KV computations for shared prefixes. If your workload has repeated prefixes — multi-turn conversations, RAG over a shared document corpus, few-shot prompting with a fixed prefix — SGLang reuses the cached computation rather than recomputing it. The throughput advantage is most pronounced in these cases. For single-turn, random-input workloads, the gap between SGLang and vLLM narrows.

QuantizationMemory reductionQuality lossUse when
FP16 (baseline)NoneNoneHighest fidelity; use when accuracy is critical and GPU memory is not a constraint
INT8 / FP8 (AWQ, GPTQ)~2×< 2% on most tasksProduction default — sweet spot of quality vs cost; native FP8 on H100 = near FP16 quality with 2× throughput
INT4 (GPTQ, GGUF)~4×3–7% on most tasksHigh-volume, lower-accuracy tasks (classification, routing); or CPU/edge inference with GGUF
2-bit (AQLM, QuIP#)~8×5–15%Experimental; use only with extensive task-specific evaluation
Launch Llama 4 Scout on SGLang (H100, INT8)

Migration Playbook: API to Self-Hosted

Moving from a closed API to self-hosted (or hosted open model API) requires an eval-first, traffic-ramp approach. Cutting over all traffic in one shot is how teams discover quality regressions in production. The following order is non-negotiable.

1

Build an eval harness first

Before touching infrastructure, build a representative eval set: 200–500 prompts sampled from your actual production traffic, with expected outputs reviewed by humans. Measure: task success rate, error rate, and any domain-specific metric (e.g., extraction accuracy). This is your gate — every subsequent step is measured against this baseline.

2

Run shadow traffic against the candidate model

Route 5–10% of production traffic to the candidate model (hosted open API or self-hosted) without serving the response to users. Collect outputs, score them against your eval harness. Run for at least 2 weeks to capture variance in real queries. Target: candidate model scores within 3–5% of baseline on your eval set before proceeding.

3

Validate latency and error budgets

Open model APIs and self-hosted models have different latency characteristics than closed APIs. Measure P50, P90, and P99 latency under your load profile. Validate that your timeout budgets, retry logic, and circuit breakers work with the new endpoint. Fix these before live traffic hits them.

4

Ramp live traffic gradually

Use a feature flag or weighted routing to ramp: 5% → 20% → 50% → 100% over two weeks minimum. Monitor your eval metrics and error rate at each step. Automate a rollback trigger: if task success rate drops more than 5% from baseline, automatically roll back to the previous model.

5

Establish ongoing quality monitoring

Once fully migrated, schedule weekly eval runs on a random sample of production outputs. Open model providers can update models with behavioral changes, and self-hosted models can drift if fine-tuning is applied without re-evaluation. Set a quality alert threshold and page on degradation.

Do not skip the shadow traffic phase

The shadow traffic phase catches distribution shift between your eval set and real production traffic. Teams that skip it and go straight to 10% live traffic regularly discover edge cases — unusual input formats, languages, or adversarial prompts — that their eval set did not cover. Two weeks of shadow traffic is cheap compared to a visible quality regression.

Decision Framework

Four dimensions determine the right deployment tier for any given workload. Score your situation on each axis — if any single dimension strongly favors one option, it usually overrides the others.

Production GeneralGPT-5.4 · Gemini 3.1 ProBroad tasks, flexible budgetFrontier / AgenticOpus 4.7 · o3 · Sonnet 4.6Complex reasoning, coding agentsVolume / SimpleFlash-Lite · GPT-5.4 NanoHigh throughput, low per-call costEfficient QualityDeepSeek V3.2 · GPT-5.4 MiniHard tasks on a tight budgetSimple tasksComplex / ReasoningBudgetPremiumPrivacy/compliance required → self-host: Llama 4 or DeepSeek V3.2

Pick your quadrant, then tune down to cheaper model once quality is verified

DimensionFavors Closed APIFavors Hosted Open APIFavors Self-Hosted
Privacy / ComplianceNon-sensitive data; provider DPA is sufficientNon-sensitive; want open model quality at lower costPII, healthcare, financial, defense, legal; data cannot leave your infra
Daily Token Volume< 5M tokens/day5M–300M tokens/day> 300M tokens/day (with ops team)
Ops CapacityNo ML infra teamNo ML infra teamDedicated ML infra team (minimum 0.5 FTE)
Customization NeedPrompt engineering is sufficientPrompt engineering is sufficient; model selection flexibilityFine-tuning, custom decoding, domain adaptation, or model modification needed
The standard progression

Start with a closed API to validate product-market fit and establish your quality baseline. Once you have stable usage above 5M tokens/day, evaluate hosted open model APIs — the migration is usually a one-afternoon API endpoint swap. Only consider self-hosting when privacy requirements or volume above 300M tokens/day make the ops overhead worthwhile.

Best Practices

Best Practices

Do

  • Evaluate hosted open model APIs (Together, Fireworks, Groq) before committing to self-hosting — 3–10× cheaper with zero ops overhead
  • Start with closed APIs for the first version of any feature — validate the prompt and quality baseline before optimizing cost
  • Build an eval harness before any model migration — 200 production-sampled prompts with human-reviewed expected outputs
  • Read the full model license text before building production dependencies on any open model
  • Check for Llama 4 EU geographic restrictions if your company has EU domicile, EU-based principals, or EU users in scope for data processing
  • Budget self-hosting engineering time explicitly in TCO calculations — 0.5 FTE setup and 0.1 FTE ongoing at your fully loaded salary
  • Benchmark your specific workload against candidate models, not just public benchmarks — domain gaps are where the choice gets made
  • Build model-agnostic abstractions so you can swap inference providers without rewiring application logic
  • Ramp live traffic gradually over at least two weeks with automated quality gates and rollback triggers

Don’t

  • Don't self-host before your volume exceeds 300M tokens/day with ops overhead factored in — the break-even is much later than the GPU-cost-only math suggests
  • Don't assume 'open source' means Apache 2.0 — Llama 4, Gemma, and Mistral each have distinct restrictions
  • Don't use Llama 4 for products with EU-domiciled companies or principals without explicit legal review and a Meta agreement
  • Don't run A100s for new deployments — H100s deliver 2–3× better cost-per-token in 2026 at comparable or lower spot prices
  • Don't skip quantization analysis — INT8 (AWQ/GPTQ) halves GPU memory with <2% quality loss on most tasks
  • Don't cut over 100% of traffic to a new model in one shot — always ramp with eval gates and automatic rollback
  • Don't choose vLLM or SGLang without benchmarking your specific workload — the 29% throughput delta only shows up on workloads that benefit from RadixAttention
  • Don't ignore model deprecation risk with closed APIs — provider pricing and model availability change without notice on timelines you cannot control
  • Don't use open model outputs as fine-tuning data without checking the license — many licenses restrict using outputs to train competing models

Key Takeaways

  • The quality gap between top open and closed models is now single-digit percentage points on most tasks — the choice is primarily about ops, cost, and licensing, not capability.
  • Three deployment tiers exist: closed API, hosted open model API (zero ops, 3–10× cheaper), and self-hosted — most teams skip the middle tier and pay more than they need to.
  • Self-hosting beats closed APIs at ~40M tokens/day in GPU-cost-only terms, but beats hosted open model APIs only at ~1B tokens/day once ops overhead is included.
  • Llama 4 Community License bans EU-domiciled companies from using any Llama 4 model — the entire Llama 4 family is multimodal and subject to this restriction.
  • For self-hosting, the inference engine choice matters: SGLang delivers ~29% more throughput than vLLM on H100s for multi-turn and RAG workloads with shared prefixes.
  • Always migrate with shadow traffic, eval gates, and a staged ramp — cutting over 100% of traffic in one shot is how teams discover production quality regressions.

Video on this topic

Open vs closed AI models: the real trade-offs in 2026

instagram