Intermediate14 min

Open vs Closed Models

The false binary of 'pick one side' has been replaced by three deployment tiers in 2026: closed API, hosted open model API, and self-hosted. The quality gap has nearly closed on most tasks. The decision is now mostly about ops budget, volume, and licensing risk — not capability.

Quick Reference

→The quality gap between top open and top closed models is now single-digit percentage points on most benchmarks
→Three deployment tiers exist: closed API, hosted open model API (3–10× cheaper, zero ops), and self-hosted
→Self-hosting beats closed APIs at ~40M tokens/day; beats hosted open APIs only at ~1B tokens/day once ops overhead is included
→Llama 4 Community License bans EU-domiciled companies from using any Llama 4 model (all are multimodal)
→Apache 2.0 models with frontier-adjacent quality in 2026: Gemma 4, Qwen 3.5, DeepSeek V3.2, Mistral Small 4
→SGLang delivers ~29% higher throughput than vLLM on H100s — inference engine choice matters as much as model choice
→INT8 quantization (AWQ/GPTQ) halves GPU memory requirements with <2% quality loss on most standard tasks

In this article

1.Should You Even Choose?
2.The Quality Gap in 2026
3.Three Deployment Models
4.Self-Hosting Economics: Real Math
5.The Licensing Minefield
6.Your Inference Stack
7.Migration Playbook: API to Self-Hosted
8.Decision Framework
★Best Practices
✓Key Takeaways

Should You Even Choose?

The 'open vs closed' framing implies a binary decision. In practice, most production AI systems use 2–4 models. A frontier closed model handles the hardest reasoning. A mid-tier model (often closed or hosted open) handles the majority of traffic. A small specialized model handles high-volume classification or embeddings. The right question is not which side to pick — it is which model tier matches which workload.

▸Closed models: highest quality ceiling, zero infrastructure, highest cost per token, no data privacy guarantees
▸Hosted open model APIs (Together, Fireworks, Groq): open model quality at 3–10× lower cost than closed APIs, zero self-hosting ops — the most overlooked tier
▸Self-hosted open models: full control, data stays on your infrastructure, lowest cost at scale, but requires an ML infra team

The middle tier most teams skip

Hosted open model APIs give you open model pricing without a single GPU to manage. If you are paying more than $500/month in closed model API fees and do not have strict on-premises data requirements, you have almost certainly not evaluated this tier seriously enough.

The Quality Gap in 2026

The gap between the best open and best closed models has compressed significantly. As of April 2026, top open models match closed models on knowledge benchmarks and are within single-digit percentage points on most reasoning tasks. The gap is still real on the hardest tasks — complex multi-step reasoning, frontier coding challenges — but it is no longer categorical.

Task type	Gap (open vs closed)	Notes
Reading comprehension, summarization, extraction	Effectively zero	Standard RAG pipelines: open models fully competitive
Code generation (standard)	1–3 points	SWE-bench Verified: ~3pt gap between top open and top closed
Multi-step reasoning	3–8 points	Closed models maintain an edge; gap is shrinking each quarter
Instruction following	2–5 points	Best open models (Llama 4 Maverick, Qwen 3.5) close to parity
Domain-specific tasks (fine-tuned)	Open often wins	Fine-tuned open model on your data routinely beats generic closed model

Benchmarks measure the average case

General benchmark scores do not predict performance on your specific task. Run your own evaluation on a representative 200–500 prompt sample from your actual workload before drawing conclusions. A fine-tuned 70B open model on your domain data will often match or beat a frontier closed model on your specific task while costing 10× less.

Input cost ($/1M tokens) vs capability tier — most production tasks live in Mid-Range

Three Deployment Models

In 2026 there are three distinct deployment tiers, not two. The middle tier — hosted open model APIs — has matured into a first-class option that most teams underutilize.

Hosted Open API is the overlooked middle tier — open model pricing, zero ops burden

Tier	Examples	Cost (relative)	Ops burden	Best for
Closed API	GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro	$$$$	None	Highest-quality tasks, prototyping, when simplicity matters
Hosted Open API	Together AI, Fireworks AI, Groq	$$	None	Production workloads where cost matters but ops expertise is limited
Self-Hosted	Llama 4, Qwen 3.5, Gemma 4	$	Full infra team required	Privacy/compliance requirements, very high volume (>300M tok/day), maximum control

Default to hosted open model APIs for the first migration

If you are moving away from a closed API for cost reasons, try a hosted open model API first. You get 3–10× cost reduction with zero operational complexity. Self-hosting only makes sense once you have validated quality on the hosted tier AND your volume or privacy requirements justify the ops investment.

Self-Hosting Economics: Real Math

The economics of self-hosting depend on three variables: token volume, model throughput on your GPU, and the often-forgotten cost of engineering time. Using H100 80GB cloud spot at $2/hr (Apr 2026 market rate), a production cluster with 2× H100 minimum for reliability, and 0.1 FTE of ongoing ops overhead:

Daily volume	Closed API/mo	Hosted Open API/mo	Self-Hosted/mo (incl. ops)	Decision
10M tokens/day	~$900	~$150	~$4,994 (2×H100 + ops)	Hosted open wins clearly
50M tokens/day	~$4,500	~$750	~$4,994 (2×H100 + ops)	Hosted open wins; self-host beats closed API only
300M tokens/day	~$27,000	~$4,500	~$8,750 (4×H100 + ops)	Self-host beats closed; hosted open still competitive
1B tokens/day	~$90,000	~$15,000	~$20,200 (10×H100 + ops)	Self-host wins over both

The ops overhead that breaks the math

The table above includes 0.1 FTE of ongoing ops cost (~$1,250/month at a $150k loaded salary). Initial setup costs more: plan for 0.5 FTE of engineering time to configure vLLM or SGLang, set up model serving, monitoring, and load balancing. At $150k/year, that is $6,250/month for the first 3 months. Most self-hosting ROI calculations that look compelling are forgetting this line.

Closed API vs Self-Hosted crossover ≈ 40M tokens/day · Hosted Open API beats Self-Hosted until ~1B tokens/day

Self-hosting cost estimator (H100 pricing, Apr 2026)

The Licensing Minefield

Not all open models are equally free to use commercially. Licenses range from fully permissive (Apache 2.0, MIT) to restrictive custom licenses that can legally block your entire company. The table below summarizes the current landscape as of April 2026.

"Open source" ≠ Apache 2.0 — always read the actual license before building production systems

Model	License	Commercial use	Key restrictions
Gemma 4	Apache 2.0	Yes, unrestricted	None — most permissive frontier-adjacent model as of Apr 2026
Qwen 3.5 (open variants)	Apache 2.0	Yes, unrestricted	Qwen 3.6-Plus and above are proprietary; check the specific variant
Mistral Small 4 / Mixtral 8x22B	Apache 2.0	Yes, unrestricted	None
DeepSeek V3.2	MIT	Yes, unrestricted	None
Llama 4 Scout / Maverick	Llama 4 Community	Conditional	700M+ MAU requires Meta approval; EU-domiciled companies/individuals: banned from all Llama 4 models (all are multimodal)
Mistral Large 3	Mistral Commercial	With commercial license	Contact Mistral for terms; not Apache 2.0
GPT-5.4, Claude 4.x, Gemini 3.1	Proprietary	API only	No access to weights; subject to provider ToS, pricing changes, rate limits, and service discontinuation

The Llama 4 EU ban — a critical gotcha

Llama 4's Community License explicitly prohibits use by individuals domiciled in or companies with a principal place of business in the European Union. This restriction applies to all Llama 4 models because all Llama 4 variants (Scout, Maverick) are multimodal. If your company is EU-domiciled, your engineers have EU addresses, or your principal place of business is in the EU — Llama 4 is legally off-limits without a special agreement from Meta. Use Gemma 4 or DeepSeek V3.2 instead.

Real project

A startup built their medical records summarizer on Llama 4 Maverick after validating quality. Three months before their EU launch, legal review flagged the EU geographic restriction in the Llama 4 Community License — the company's parent entity was registered in Germany. They migrated to Gemma 4 (Apache 2.0) in two weeks, but the unplanned migration cost one sprint and delayed the EU launch by three weeks.

▸Apache 2.0 is the gold standard: unrestricted use, modification, distribution, fine-tuning, and derivative works
▸Always read the actual license text — not a summary — before committing to a model in production
▸Derivative works (fine-tuned models) inherit the base model's license; verify before distributing a fine-tune
▸Some licenses restrict using model outputs as training data — verify before using open model outputs as synthetic data
▸License terms can change between model versions: Llama 3.x and Llama 4 have different licenses

Your Inference Stack

Choosing a model is only half the work. For self-hosted deployments, the inference engine and quantization strategy determine your actual cost and throughput. In 2026 there are two dominant open-source inference engines — vLLM and SGLang — and the choice can affect throughput by 20–30%.

Engine	Best for	H100 throughput edge	Ecosystem
vLLM	Broadest hardware support (TPUs, Trainium, Gaudi), OpenAI-compatible API, largest contributor base	Baseline	Default for most cloud API endpoints; 3× more contributors than SGLang
SGLang	Multi-turn chat, structured JSON output (3× faster constrained decoding), RAG with shared prefixes	+29% vs vLLM on H100	Spun out as RadixArk (Jan 2026, ~$400M valuation); powers Grok 3, Cursor, LinkedIn, Azure endpoints

When to pick SGLang

SGLang's RadixAttention caches KV computations for shared prefixes. If your workload has repeated prefixes — multi-turn conversations, RAG over a shared document corpus, few-shot prompting with a fixed prefix — SGLang reuses the cached computation rather than recomputing it. The throughput advantage is most pronounced in these cases. For single-turn, random-input workloads, the gap between SGLang and vLLM narrows.

Quantization	Memory reduction	Quality loss	Use when
FP16 (baseline)	None	None	Highest fidelity; use when accuracy is critical and GPU memory is not a constraint
INT8 / FP8 (AWQ, GPTQ)	~2×	< 2% on most tasks	Production default — sweet spot of quality vs cost; native FP8 on H100 = near FP16 quality with 2× throughput
INT4 (GPTQ, GGUF)	~4×	3–7% on most tasks	High-volume, lower-accuracy tasks (classification, routing); or CPU/edge inference with GGUF
2-bit (AQLM, QuIP#)	~8×	5–15%	Experimental; use only with extensive task-specific evaluation

Launch Llama 4 Scout on SGLang (H100, INT8)

Migration Playbook: API to Self-Hosted

Moving from a closed API to self-hosted (or hosted open model API) requires an eval-first, traffic-ramp approach. Cutting over all traffic in one shot is how teams discover quality regressions in production. The following order is non-negotiable.

Build an eval harness first

Before touching infrastructure, build a representative eval set: 200–500 prompts sampled from your actual production traffic, with expected outputs reviewed by humans. Measure: task success rate, error rate, and any domain-specific metric (e.g., extraction accuracy). This is your gate — every subsequent step is measured against this baseline.

Run shadow traffic against the candidate model

Route 5–10% of production traffic to the candidate model (hosted open API or self-hosted) without serving the response to users. Collect outputs, score them against your eval harness. Run for at least 2 weeks to capture variance in real queries. Target: candidate model scores within 3–5% of baseline on your eval set before proceeding.

Validate latency and error budgets

Open model APIs and self-hosted models have different latency characteristics than closed APIs. Measure P50, P90, and P99 latency under your load profile. Validate that your timeout budgets, retry logic, and circuit breakers work with the new endpoint. Fix these before live traffic hits them.

Ramp live traffic gradually

Use a feature flag or weighted routing to ramp: 5% → 20% → 50% → 100% over two weeks minimum. Monitor your eval metrics and error rate at each step. Automate a rollback trigger: if task success rate drops more than 5% from baseline, automatically roll back to the previous model.

Establish ongoing quality monitoring

Once fully migrated, schedule weekly eval runs on a random sample of production outputs. Open model providers can update models with behavioral changes, and self-hosted models can drift if fine-tuning is applied without re-evaluation. Set a quality alert threshold and page on degradation.

Do not skip the shadow traffic phase

The shadow traffic phase catches distribution shift between your eval set and real production traffic. Teams that skip it and go straight to 10% live traffic regularly discover edge cases — unusual input formats, languages, or adversarial prompts — that their eval set did not cover. Two weeks of shadow traffic is cheap compared to a visible quality regression.

Decision Framework

Four dimensions determine the right deployment tier for any given workload. Score your situation on each axis — if any single dimension strongly favors one option, it usually overrides the others.

Pick your quadrant, then tune down to cheaper model once quality is verified

Dimension	Favors Closed API	Favors Hosted Open API	Favors Self-Hosted
Privacy / Compliance	Non-sensitive data; provider DPA is sufficient	Non-sensitive; want open model quality at lower cost	PII, healthcare, financial, defense, legal; data cannot leave your infra
Daily Token Volume	< 5M tokens/day	5M–300M tokens/day	> 300M tokens/day (with ops team)
Ops Capacity	No ML infra team	No ML infra team	Dedicated ML infra team (minimum 0.5 FTE)
Customization Need	Prompt engineering is sufficient	Prompt engineering is sufficient; model selection flexibility	Fine-tuning, custom decoding, domain adaptation, or model modification needed

The standard progression

Start with a closed API to validate product-market fit and establish your quality baseline. Once you have stable usage above 5M tokens/day, evaluate hosted open model APIs — the migration is usually a one-afternoon API endpoint swap. Only consider self-hosting when privacy requirements or volume above 300M tokens/day make the ops overhead worthwhile.

Best Practices

✓Evaluate hosted open model APIs (Together, Fireworks, Groq) before committing to self-hosting — 3–10× cheaper with zero ops overhead
✓Start with closed APIs for the first version of any feature — validate the prompt and quality baseline before optimizing cost
✓Build an eval harness before any model migration — 200 production-sampled prompts with human-reviewed expected outputs
✓Read the full model license text before building production dependencies on any open model
✓Check for Llama 4 EU geographic restrictions if your company has EU domicile, EU-based principals, or EU users in scope for data processing
✓Budget self-hosting engineering time explicitly in TCO calculations — 0.5 FTE setup and 0.1 FTE ongoing at your fully loaded salary
✓Benchmark your specific workload against candidate models, not just public benchmarks — domain gaps are where the choice gets made
✓Build model-agnostic abstractions so you can swap inference providers without rewiring application logic
✓Ramp live traffic gradually over at least two weeks with automated quality gates and rollback triggers

Don’t

✗Don't self-host before your volume exceeds 300M tokens/day with ops overhead factored in — the break-even is much later than the GPU-cost-only math suggests
✗Don't assume 'open source' means Apache 2.0 — Llama 4, Gemma, and Mistral each have distinct restrictions
✗Don't use Llama 4 for products with EU-domiciled companies or principals without explicit legal review and a Meta agreement
✗Don't run A100s for new deployments — H100s deliver 2–3× better cost-per-token in 2026 at comparable or lower spot prices
✗Don't skip quantization analysis — INT8 (AWQ/GPTQ) halves GPU memory with <2% quality loss on most tasks
✗Don't cut over 100% of traffic to a new model in one shot — always ramp with eval gates and automatic rollback
✗Don't choose vLLM or SGLang without benchmarking your specific workload — the 29% throughput delta only shows up on workloads that benefit from RadixAttention
✗Don't ignore model deprecation risk with closed APIs — provider pricing and model availability change without notice on timelines you cannot control
✗Don't use open model outputs as fine-tuning data without checking the license — many licenses restrict using outputs to train competing models

Key Takeaways

✓The quality gap between top open and closed models is now single-digit percentage points on most tasks — the choice is primarily about ops, cost, and licensing, not capability.
✓Three deployment tiers exist: closed API, hosted open model API (zero ops, 3–10× cheaper), and self-hosted — most teams skip the middle tier and pay more than they need to.
✓Self-hosting beats closed APIs at ~40M tokens/day in GPU-cost-only terms, but beats hosted open model APIs only at ~1B tokens/day once ops overhead is included.
✓Llama 4 Community License bans EU-domiciled companies from using any Llama 4 model — the entire Llama 4 family is multimodal and subject to this restriction.
✓For self-hosting, the inference engine choice matters: SGLang delivers ~29% more throughput than vLLM on H100s for multi-turn and RAG workloads with shared prefixes.
✓Always migrate with shadow traffic, eval gates, and a staged ramp — cutting over 100% of traffic in one shot is how teams discover production quality regressions.

Video on this topic

Open vs closed AI models: the real trade-offs in 2026

instagram

←

Model Families Compared

Model Selection Framework

→