Production & Scale/Inference Optimization
★ OverviewAdvanced16 min

Self-Hosting LLMs

When and how to self-host open-weight LLMs in production: comparing vLLM, SGLang, and TGI, understanding MoE vs dense model tradeoffs, calculating the break-even point against both frontier and hosted open-model APIs, and deploying a high-throughput serving stack with proper monitoring.

Quick Reference

  • Self-host when: regulated data/privacy, >$10K/month on frontier APIs, <200ms TTFT requirement, or fine-tuned models
  • Break-even: ~5M tokens/day vs frontier APIs (Claude, GPT-5), but ~50M tokens/day vs hosted open-model APIs (Fireworks, Together) — always compare the right baseline
  • vLLM: production default — widest model support, OpenAI-compatible API, hardware flexibility (GPU, TPU, Trainium)
  • SGLang: throughput leader on H100s — 29% faster than vLLM; RadixAttention gives 6.4x speedup on prefix-heavy workloads
  • TGI v3: best for long-prompt workloads and teams deep in the HuggingFace ecosystem
  • MoE VRAM trap: Llama 4 Scout has 17B active parameters but needs 109B loaded in VRAM — all expert weights must fit
  • GPU memory formula: (params_B × bytes_per_param) + KV_cache + ~7 GB CUDA overhead = total VRAM needed
  • Monitor TTFT, throughput, queue depth, and KV cache usage from day one — self-hosted LLMs fail silently

Should You Self-Host?

Self-hosting is an infrastructure commitment

Self-hosting LLMs means you own GPU provisioning, model updates, scaling, and reliability. This is a significant operational burden. Only self-host when the benefits — privacy, cost, latency, customization — clearly outweigh that cost. For most teams below 5M tokens/day, API providers win on total cost of ownership.

NOYESNONOYESNOYESYESYESNORegulated data orprivacy mandate?Spending >$10K/monthon inference APIs?Need <200ms TTFTconsistently?Fine-tuned orcustom model?Team has GPUinfra experience?Self-host(privacy)Self-host(latency)Self-host(custom model)Self-host(cost-driven)Use APIprovidersHosted open APIs(start here first)

Most teams stop at gate 2 — API spend below $10K/month means operational overhead exceeds GPU savings

MotivationSelf-host whenUse APIs when
Data privacyRegulated industries (healthcare, finance), PII in prompts, on-premise requirementNon-sensitive data, provider offers BAA/DPA, compliance met by contract
Cost at scaleSpending >$10K/month on frontier APIs (Claude, GPT-5) or >$50K/month on hosted open-model APIsBelow those thresholds — operational overhead exceeds GPU savings
LatencyNeed <200ms TTFT consistently, P99 latency SLA that APIs can't guarantee500ms–1s TTFT acceptable, or latency variance is tolerable for your use case
CustomizationFine-tuned models, LoRA adapters, custom architectures not available via APIOff-the-shelf models via prompt engineering cover your use case
AvailabilityNeed 99.99% uptime with zero external dependency99.9% with multi-provider failover is acceptable

The most common mistake is self-hosting too early. The break-even point depends on which API you're replacing. Against frontier APIs like Claude Opus or GPT-5 (~$3–15/1M tokens), self-hosting a 70B model on an A100 80GB spot instance (~$2.50/hr) breaks even around 5M tokens/day, assuming a 70/30 input/output split. Against hosted open-model APIs like Fireworks or Together AI serving the same Llama 3.3 70B at ~$0.50–0.90/1M tokens, that break-even jumps to 50M+ tokens/day. Teams often compare against frontier API prices while planning to self-host an open model — the savings are real, but so is the 10x difference in the crossover point.