Open vs Closed Models
The false binary of 'pick one side' has been replaced by three deployment tiers in 2026: closed API, hosted open model API, and self-hosted. The quality gap has nearly closed on most tasks. The decision is now mostly about ops budget, volume, and licensing risk — not capability.
Quick Reference
- →The quality gap between top open and top closed models is now single-digit percentage points on most benchmarks
- →Three deployment tiers exist: closed API, hosted open model API (3–10× cheaper, zero ops), and self-hosted
- →Self-hosting beats closed APIs at ~40M tokens/day; beats hosted open APIs only at ~1B tokens/day once ops overhead is included
- →Llama 4 Community License bans EU-domiciled companies from using any Llama 4 model (all are multimodal)
- →Apache 2.0 models with frontier-adjacent quality in 2026: Gemma 4, Qwen 3.5, DeepSeek V3.2, Mistral Small 4
- →SGLang delivers ~29% higher throughput than vLLM on H100s — inference engine choice matters as much as model choice
- →INT8 quantization (AWQ/GPTQ) halves GPU memory requirements with <2% quality loss on most standard tasks
In this article
Should You Even Choose?
The 'open vs closed' framing implies a binary decision. In practice, most production AI systems use 2–4 models. A frontier closed model handles the hardest reasoning. A mid-tier model (often closed or hosted open) handles the majority of traffic. A small specialized model handles high-volume classification or embeddings. The right question is not which side to pick — it is which model tier matches which workload.
- ▸Closed models: highest quality ceiling, zero infrastructure, highest cost per token, no data privacy guarantees
- ▸Hosted open model APIs (Together, Fireworks, Groq): open model quality at 3–10× lower cost than closed APIs, zero self-hosting ops — the most overlooked tier
- ▸Self-hosted open models: full control, data stays on your infrastructure, lowest cost at scale, but requires an ML infra team
Hosted open model APIs give you open model pricing without a single GPU to manage. If you are paying more than $500/month in closed model API fees and do not have strict on-premises data requirements, you have almost certainly not evaluated this tier seriously enough.
The Quality Gap in 2026
The gap between the best open and best closed models has compressed significantly. As of April 2026, top open models match closed models on knowledge benchmarks and are within single-digit percentage points on most reasoning tasks. The gap is still real on the hardest tasks — complex multi-step reasoning, frontier coding challenges — but it is no longer categorical.
| Task type | Gap (open vs closed) | Notes |
|---|---|---|
| Reading comprehension, summarization, extraction | Effectively zero | Standard RAG pipelines: open models fully competitive |
| Code generation (standard) | 1–3 points | SWE-bench Verified: ~3pt gap between top open and top closed |
| Multi-step reasoning | 3–8 points | Closed models maintain an edge; gap is shrinking each quarter |
| Instruction following | 2–5 points | Best open models (Llama 4 Maverick, Qwen 3.5) close to parity |
| Domain-specific tasks (fine-tuned) | Open often wins | Fine-tuned open model on your data routinely beats generic closed model |
General benchmark scores do not predict performance on your specific task. Run your own evaluation on a representative 200–500 prompt sample from your actual workload before drawing conclusions. A fine-tuned 70B open model on your domain data will often match or beat a frontier closed model on your specific task while costing 10× less.
Input cost ($/1M tokens) vs capability tier — most production tasks live in Mid-Range
Three Deployment Models
In 2026 there are three distinct deployment tiers, not two. The middle tier — hosted open model APIs — has matured into a first-class option that most teams underutilize.
Hosted Open API is the overlooked middle tier — open model pricing, zero ops burden
| Tier | Examples | Cost (relative) | Ops burden | Best for |
|---|---|---|---|---|
| Closed API | GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro | $$$$ | None | Highest-quality tasks, prototyping, when simplicity matters |
| Hosted Open API | Together AI, Fireworks AI, Groq | $$ | None | Production workloads where cost matters but ops expertise is limited |
| Self-Hosted | Llama 4, Qwen 3.5, Gemma 4 | $ | Full infra team required | Privacy/compliance requirements, very high volume (>300M tok/day), maximum control |
If you are moving away from a closed API for cost reasons, try a hosted open model API first. You get 3–10× cost reduction with zero operational complexity. Self-hosting only makes sense once you have validated quality on the hosted tier AND your volume or privacy requirements justify the ops investment.
Self-Hosting Economics: Real Math
The economics of self-hosting depend on three variables: token volume, model throughput on your GPU, and the often-forgotten cost of engineering time. Using H100 80GB cloud spot at $2/hr (Apr 2026 market rate), a production cluster with 2× H100 minimum for reliability, and 0.1 FTE of ongoing ops overhead:
| Daily volume | Closed API/mo | Hosted Open API/mo | Self-Hosted/mo (incl. ops) | Decision |
|---|---|---|---|---|
| 10M tokens/day | ~$900 | ~$150 | ~$4,994 (2×H100 + ops) | Hosted open wins clearly |
| 50M tokens/day | ~$4,500 | ~$750 | ~$4,994 (2×H100 + ops) | Hosted open wins; self-host beats closed API only |
| 300M tokens/day | ~$27,000 | ~$4,500 | ~$8,750 (4×H100 + ops) | Self-host beats closed; hosted open still competitive |
| 1B tokens/day | ~$90,000 | ~$15,000 | ~$20,200 (10×H100 + ops) | Self-host wins over both |
The table above includes 0.1 FTE of ongoing ops cost (~$1,250/month at a $150k loaded salary). Initial setup costs more: plan for 0.5 FTE of engineering time to configure vLLM or SGLang, set up model serving, monitoring, and load balancing. At $150k/year, that is $6,250/month for the first 3 months. Most self-hosting ROI calculations that look compelling are forgetting this line.
Closed API vs Self-Hosted crossover ≈ 40M tokens/day · Hosted Open API beats Self-Hosted until ~1B tokens/day
The Licensing Minefield
Not all open models are equally free to use commercially. Licenses range from fully permissive (Apache 2.0, MIT) to restrictive custom licenses that can legally block your entire company. The table below summarizes the current landscape as of April 2026.
"Open source" ≠ Apache 2.0 — always read the actual license before building production systems
| Model | License | Commercial use | Key restrictions |
|---|---|---|---|
| Gemma 4 | Apache 2.0 | Yes, unrestricted | None — most permissive frontier-adjacent model as of Apr 2026 |
| Qwen 3.5 (open variants) | Apache 2.0 | Yes, unrestricted | Qwen 3.6-Plus and above are proprietary; check the specific variant |
| Mistral Small 4 / Mixtral 8x22B | Apache 2.0 | Yes, unrestricted | None |
| DeepSeek V3.2 | MIT | Yes, unrestricted | None |
| Llama 4 Scout / Maverick | Llama 4 Community | Conditional | 700M+ MAU requires Meta approval; EU-domiciled companies/individuals: banned from all Llama 4 models (all are multimodal) |
| Mistral Large 3 | Mistral Commercial | With commercial license | Contact Mistral for terms; not Apache 2.0 |
| GPT-5.4, Claude 4.x, Gemini 3.1 | Proprietary | API only | No access to weights; subject to provider ToS, pricing changes, rate limits, and service discontinuation |
Llama 4's Community License explicitly prohibits use by individuals domiciled in or companies with a principal place of business in the European Union. This restriction applies to all Llama 4 models because all Llama 4 variants (Scout, Maverick) are multimodal. If your company is EU-domiciled, your engineers have EU addresses, or your principal place of business is in the EU — Llama 4 is legally off-limits without a special agreement from Meta. Use Gemma 4 or DeepSeek V3.2 instead.
A startup built their medical records summarizer on Llama 4 Maverick after validating quality. Three months before their EU launch, legal review flagged the EU geographic restriction in the Llama 4 Community License — the company's parent entity was registered in Germany. They migrated to Gemma 4 (Apache 2.0) in two weeks, but the unplanned migration cost one sprint and delayed the EU launch by three weeks.
- ▸Apache 2.0 is the gold standard: unrestricted use, modification, distribution, fine-tuning, and derivative works
- ▸Always read the actual license text — not a summary — before committing to a model in production
- ▸Derivative works (fine-tuned models) inherit the base model's license; verify before distributing a fine-tune
- ▸Some licenses restrict using model outputs as training data — verify before using open model outputs as synthetic data
- ▸License terms can change between model versions: Llama 3.x and Llama 4 have different licenses
Your Inference Stack
Choosing a model is only half the work. For self-hosted deployments, the inference engine and quantization strategy determine your actual cost and throughput. In 2026 there are two dominant open-source inference engines — vLLM and SGLang — and the choice can affect throughput by 20–30%.
| Engine | Best for | H100 throughput edge | Ecosystem |
|---|---|---|---|
| vLLM | Broadest hardware support (TPUs, Trainium, Gaudi), OpenAI-compatible API, largest contributor base | Baseline | Default for most cloud API endpoints; 3× more contributors than SGLang |
| SGLang | Multi-turn chat, structured JSON output (3× faster constrained decoding), RAG with shared prefixes | +29% vs vLLM on H100 | Spun out as RadixArk (Jan 2026, ~$400M valuation); powers Grok 3, Cursor, LinkedIn, Azure endpoints |
SGLang's RadixAttention caches KV computations for shared prefixes. If your workload has repeated prefixes — multi-turn conversations, RAG over a shared document corpus, few-shot prompting with a fixed prefix — SGLang reuses the cached computation rather than recomputing it. The throughput advantage is most pronounced in these cases. For single-turn, random-input workloads, the gap between SGLang and vLLM narrows.
| Quantization | Memory reduction | Quality loss | Use when |
|---|---|---|---|
| FP16 (baseline) | None | None | Highest fidelity; use when accuracy is critical and GPU memory is not a constraint |
| INT8 / FP8 (AWQ, GPTQ) | ~2× | < 2% on most tasks | Production default — sweet spot of quality vs cost; native FP8 on H100 = near FP16 quality with 2× throughput |
| INT4 (GPTQ, GGUF) | ~4× | 3–7% on most tasks | High-volume, lower-accuracy tasks (classification, routing); or CPU/edge inference with GGUF |
| 2-bit (AQLM, QuIP#) | ~8× | 5–15% | Experimental; use only with extensive task-specific evaluation |
Migration Playbook: API to Self-Hosted
Moving from a closed API to self-hosted (or hosted open model API) requires an eval-first, traffic-ramp approach. Cutting over all traffic in one shot is how teams discover quality regressions in production. The following order is non-negotiable.
Build an eval harness first
Before touching infrastructure, build a representative eval set: 200–500 prompts sampled from your actual production traffic, with expected outputs reviewed by humans. Measure: task success rate, error rate, and any domain-specific metric (e.g., extraction accuracy). This is your gate — every subsequent step is measured against this baseline.
Run shadow traffic against the candidate model
Route 5–10% of production traffic to the candidate model (hosted open API or self-hosted) without serving the response to users. Collect outputs, score them against your eval harness. Run for at least 2 weeks to capture variance in real queries. Target: candidate model scores within 3–5% of baseline on your eval set before proceeding.
Validate latency and error budgets
Open model APIs and self-hosted models have different latency characteristics than closed APIs. Measure P50, P90, and P99 latency under your load profile. Validate that your timeout budgets, retry logic, and circuit breakers work with the new endpoint. Fix these before live traffic hits them.
Ramp live traffic gradually
Use a feature flag or weighted routing to ramp: 5% → 20% → 50% → 100% over two weeks minimum. Monitor your eval metrics and error rate at each step. Automate a rollback trigger: if task success rate drops more than 5% from baseline, automatically roll back to the previous model.
Establish ongoing quality monitoring
Once fully migrated, schedule weekly eval runs on a random sample of production outputs. Open model providers can update models with behavioral changes, and self-hosted models can drift if fine-tuning is applied without re-evaluation. Set a quality alert threshold and page on degradation.
The shadow traffic phase catches distribution shift between your eval set and real production traffic. Teams that skip it and go straight to 10% live traffic regularly discover edge cases — unusual input formats, languages, or adversarial prompts — that their eval set did not cover. Two weeks of shadow traffic is cheap compared to a visible quality regression.
Decision Framework
Four dimensions determine the right deployment tier for any given workload. Score your situation on each axis — if any single dimension strongly favors one option, it usually overrides the others.
Pick your quadrant, then tune down to cheaper model once quality is verified
| Dimension | Favors Closed API | Favors Hosted Open API | Favors Self-Hosted |
|---|---|---|---|
| Privacy / Compliance | Non-sensitive data; provider DPA is sufficient | Non-sensitive; want open model quality at lower cost | PII, healthcare, financial, defense, legal; data cannot leave your infra |
| Daily Token Volume | < 5M tokens/day | 5M–300M tokens/day | > 300M tokens/day (with ops team) |
| Ops Capacity | No ML infra team | No ML infra team | Dedicated ML infra team (minimum 0.5 FTE) |
| Customization Need | Prompt engineering is sufficient | Prompt engineering is sufficient; model selection flexibility | Fine-tuning, custom decoding, domain adaptation, or model modification needed |
Start with a closed API to validate product-market fit and establish your quality baseline. Once you have stable usage above 5M tokens/day, evaluate hosted open model APIs — the migration is usually a one-afternoon API endpoint swap. Only consider self-hosting when privacy requirements or volume above 300M tokens/day make the ops overhead worthwhile.
Best Practices
Do
- ✓Evaluate hosted open model APIs (Together, Fireworks, Groq) before committing to self-hosting — 3–10× cheaper with zero ops overhead
- ✓Start with closed APIs for the first version of any feature — validate the prompt and quality baseline before optimizing cost
- ✓Build an eval harness before any model migration — 200 production-sampled prompts with human-reviewed expected outputs
- ✓Read the full model license text before building production dependencies on any open model
- ✓Check for Llama 4 EU geographic restrictions if your company has EU domicile, EU-based principals, or EU users in scope for data processing
- ✓Budget self-hosting engineering time explicitly in TCO calculations — 0.5 FTE setup and 0.1 FTE ongoing at your fully loaded salary
- ✓Benchmark your specific workload against candidate models, not just public benchmarks — domain gaps are where the choice gets made
- ✓Build model-agnostic abstractions so you can swap inference providers without rewiring application logic
- ✓Ramp live traffic gradually over at least two weeks with automated quality gates and rollback triggers
Don’t
- ✗Don't self-host before your volume exceeds 300M tokens/day with ops overhead factored in — the break-even is much later than the GPU-cost-only math suggests
- ✗Don't assume 'open source' means Apache 2.0 — Llama 4, Gemma, and Mistral each have distinct restrictions
- ✗Don't use Llama 4 for products with EU-domiciled companies or principals without explicit legal review and a Meta agreement
- ✗Don't run A100s for new deployments — H100s deliver 2–3× better cost-per-token in 2026 at comparable or lower spot prices
- ✗Don't skip quantization analysis — INT8 (AWQ/GPTQ) halves GPU memory with <2% quality loss on most tasks
- ✗Don't cut over 100% of traffic to a new model in one shot — always ramp with eval gates and automatic rollback
- ✗Don't choose vLLM or SGLang without benchmarking your specific workload — the 29% throughput delta only shows up on workloads that benefit from RadixAttention
- ✗Don't ignore model deprecation risk with closed APIs — provider pricing and model availability change without notice on timelines you cannot control
- ✗Don't use open model outputs as fine-tuning data without checking the license — many licenses restrict using outputs to train competing models
Key Takeaways
- ✓The quality gap between top open and closed models is now single-digit percentage points on most tasks — the choice is primarily about ops, cost, and licensing, not capability.
- ✓Three deployment tiers exist: closed API, hosted open model API (zero ops, 3–10× cheaper), and self-hosted — most teams skip the middle tier and pay more than they need to.
- ✓Self-hosting beats closed APIs at ~40M tokens/day in GPU-cost-only terms, but beats hosted open model APIs only at ~1B tokens/day once ops overhead is included.
- ✓Llama 4 Community License bans EU-domiciled companies from using any Llama 4 model — the entire Llama 4 family is multimodal and subject to this restriction.
- ✓For self-hosting, the inference engine choice matters: SGLang delivers ~29% more throughput than vLLM on H100s for multi-turn and RAG workloads with shared prefixes.
- ✓Always migrate with shadow traffic, eval gates, and a staged ramp — cutting over 100% of traffic in one shot is how teams discover production quality regressions.
Video on this topic
Open vs closed AI models: the real trade-offs in 2026