Self-Hosting LLMs
When and how to self-host LLMs for production: comparing vLLM, TGI, and Ollama, understanding hardware requirements, calculating the break-even point against API costs, and deploying a high-throughput serving stack.
Quick Reference
- →Self-host when: privacy requirements, >$10K/month API spend, latency-sensitive (<200ms TTFT), or need custom/fine-tuned models
- →vLLM: highest throughput via PagedAttention + continuous batching — the production default for GPU serving
- →TGI: Hugging Face ecosystem integration, good for teams already on HF — slightly lower throughput than vLLM
- →Ollama: local dev and prototyping only — not designed for production multi-user serving
- →GPU memory rule: model params (in billions) x 2 bytes (FP16) = minimum VRAM, add 20% for KV cache
- →Break-even: self-hosting typically beats API costs at 50M+ tokens/day for a 70B model
When Self-Hosting Makes Sense
Self-hosting LLMs means you own GPU provisioning, model updates, scaling, and reliability. This is a significant operational burden. Only self-host when the benefits (privacy, cost, latency, customization) clearly outweigh the operational cost.
| Motivation | When It Applies | When APIs Are Better |
|---|---|---|
| Data privacy | Regulated industries (healthcare, finance), PII in prompts | Non-sensitive data, provider has BAA/DPA |
| Cost at scale | Spending >$10K/month on API calls | Spending <$5K/month — operational overhead exceeds savings |
| Latency | Need <200ms time-to-first-token consistently | Can tolerate 500ms-1s TTFT from APIs |
| Customization | Fine-tuned models, custom architectures | Off-the-shelf models with prompt engineering |
| Availability | Need 99.99% uptime with no dependency on external providers | 99.9% uptime acceptable, multi-provider failover works |
The most common mistake is self-hosting too early. Teams spend weeks setting up GPU infrastructure when their API bill is $500/month. The break-even point depends on traffic volume: for a 7B model, self-hosting on a single A10G ($0.75/hr on AWS spot) breaks even around 10M tokens/day. For a 70B model on an A100, the break-even is around 50M tokens/day.