Self-Hosting LLMs

When and how to self-host open-weight LLMs in production: comparing vLLM, SGLang, and TGI, understanding MoE vs dense model tradeoffs, calculating the break-even point against both frontier and hosted open-model APIs, and deploying a high-throughput serving stack with proper monitoring.

Quick Reference

→Self-host when: regulated data/privacy, >$10K/month on frontier APIs, <200ms TTFT requirement, or fine-tuned models
→Break-even: ~5M tokens/day vs frontier APIs (Claude, GPT-5), but ~50M tokens/day vs hosted open-model APIs (Fireworks, Together) — always compare the right baseline
→vLLM: production default — widest model support, OpenAI-compatible API, hardware flexibility (GPU, TPU, Trainium)
→SGLang: throughput leader on H100s — 29% faster than vLLM; RadixAttention gives 6.4x speedup on prefix-heavy workloads
→TGI v3: best for long-prompt workloads and teams deep in the HuggingFace ecosystem
→MoE VRAM trap: Llama 4 Scout has 17B active parameters but needs 109B loaded in VRAM — all expert weights must fit
→GPU memory formula: (params_B × bytes_per_param) + KV_cache + ~7 GB CUDA overhead = total VRAM needed
→Monitor TTFT, throughput, queue depth, and KV cache usage from day one — self-hosted LLMs fail silently

Should You Self-Host?

Self-hosting is an infrastructure commitment

Self-hosting LLMs means you own GPU provisioning, model updates, scaling, and reliability. This is a significant operational burden. Only self-host when the benefits — privacy, cost, latency, customization — clearly outweigh that cost. For most teams below 5M tokens/day, API providers win on total cost of ownership.

Most teams stop at gate 2 — API spend below $10K/month means operational overhead exceeds GPU savings

Motivation	Self-host when	Use APIs when
Data privacy	Regulated industries (healthcare, finance), PII in prompts, on-premise requirement	Non-sensitive data, provider offers BAA/DPA, compliance met by contract
Cost at scale	Spending >$10K/month on frontier APIs (Claude, GPT-5) or >$50K/month on hosted open-model APIs	Below those thresholds — operational overhead exceeds GPU savings
Latency	Need <200ms TTFT consistently, P99 latency SLA that APIs can't guarantee	500ms–1s TTFT acceptable, or latency variance is tolerable for your use case
Customization	Fine-tuned models, LoRA adapters, custom architectures not available via API	Off-the-shelf models via prompt engineering cover your use case
Availability	Need 99.99% uptime with zero external dependency	99.9% with multi-provider failover is acceptable

The most common mistake is self-hosting too early. The break-even point depends on which API you're replacing. Against frontier APIs like Claude Opus or GPT-5 (~$3–15/1M tokens), self-hosting a 70B model on an A100 80GB spot instance (~$2.50/hr) breaks even around 5M tokens/day, assuming a 70/30 input/output split. Against hosted open-model APIs like Fireworks or Together AI serving the same Llama 3.3 70B at ~$0.50–0.90/1M tokens, that break-even jumps to 50M+ tokens/day. Teams often compare against frontier API prices while planning to self-host an open model — the savings are real, but so is the 10x difference in the crossover point.

Which Model to Self-Host

Choosing a serving framework before choosing a model is backwards. The model dictates your VRAM budget, which dictates your GPU choice, which shapes your infrastructure architecture. In 2026, the open-weight landscape splits into two categories: dense models (all parameters are active per token) and Mixture-of-Experts (MoE) models (only a fraction of parameters activate per token). The distinction matters enormously for self-hosting because VRAM requirements are set by total parameters, not active ones.

Framework Comparison: vLLM vs SGLang vs TGI

Ollama excluded — sequential request handling makes it unsuitable for production multi-user serving

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.