Self-Hosting LLMs
When and how to self-host open-weight LLMs in production: comparing vLLM, SGLang, and TGI, understanding MoE vs dense model tradeoffs, calculating the break-even point against both frontier and hosted open-model APIs, and deploying a high-throughput serving stack with proper monitoring.
Quick Reference
- →Self-host when: regulated data/privacy, >$10K/month on frontier APIs, <200ms TTFT requirement, or fine-tuned models
- →Break-even: ~5M tokens/day vs frontier APIs (Claude, GPT-5), but ~50M tokens/day vs hosted open-model APIs (Fireworks, Together) — always compare the right baseline
- →vLLM: production default — widest model support, OpenAI-compatible API, hardware flexibility (GPU, TPU, Trainium)
- →SGLang: throughput leader on H100s — 29% faster than vLLM; RadixAttention gives 6.4x speedup on prefix-heavy workloads
- →TGI v3: best for long-prompt workloads and teams deep in the HuggingFace ecosystem
- →MoE VRAM trap: Llama 4 Scout has 17B active parameters but needs 109B loaded in VRAM — all expert weights must fit
- →GPU memory formula: (params_B × bytes_per_param) + KV_cache + ~7 GB CUDA overhead = total VRAM needed
- →Monitor TTFT, throughput, queue depth, and KV cache usage from day one — self-hosted LLMs fail silently
Should You Self-Host?
Self-hosting LLMs means you own GPU provisioning, model updates, scaling, and reliability. This is a significant operational burden. Only self-host when the benefits — privacy, cost, latency, customization — clearly outweigh that cost. For most teams below 5M tokens/day, API providers win on total cost of ownership.
Most teams stop at gate 2 — API spend below $10K/month means operational overhead exceeds GPU savings
| Motivation | Self-host when | Use APIs when |
|---|---|---|
| Data privacy | Regulated industries (healthcare, finance), PII in prompts, on-premise requirement | Non-sensitive data, provider offers BAA/DPA, compliance met by contract |
| Cost at scale | Spending >$10K/month on frontier APIs (Claude, GPT-5) or >$50K/month on hosted open-model APIs | Below those thresholds — operational overhead exceeds GPU savings |
| Latency | Need <200ms TTFT consistently, P99 latency SLA that APIs can't guarantee | 500ms–1s TTFT acceptable, or latency variance is tolerable for your use case |
| Customization | Fine-tuned models, LoRA adapters, custom architectures not available via API | Off-the-shelf models via prompt engineering cover your use case |
| Availability | Need 99.99% uptime with zero external dependency | 99.9% with multi-provider failover is acceptable |
The most common mistake is self-hosting too early. The break-even point depends on which API you're replacing. Against frontier APIs like Claude Opus or GPT-5 (~$3–15/1M tokens), self-hosting a 70B model on an A100 80GB spot instance (~$2.50/hr) breaks even around 5M tokens/day, assuming a 70/30 input/output split. Against hosted open-model APIs like Fireworks or Together AI serving the same Llama 3.3 70B at ~$0.50–0.90/1M tokens, that break-even jumps to 50M+ tokens/day. Teams often compare against frontier API prices while planning to self-host an open model — the savings are real, but so is the 10x difference in the crossover point.