Production & Scale/Inference Optimization
★ OverviewAdvanced12 min

Self-Hosting LLMs

When and how to self-host LLMs for production: comparing vLLM, TGI, and Ollama, understanding hardware requirements, calculating the break-even point against API costs, and deploying a high-throughput serving stack.

Quick Reference

  • Self-host when: privacy requirements, >$10K/month API spend, latency-sensitive (<200ms TTFT), or need custom/fine-tuned models
  • vLLM: highest throughput via PagedAttention + continuous batching — the production default for GPU serving
  • TGI: Hugging Face ecosystem integration, good for teams already on HF — slightly lower throughput than vLLM
  • Ollama: local dev and prototyping only — not designed for production multi-user serving
  • GPU memory rule: model params (in billions) x 2 bytes (FP16) = minimum VRAM, add 20% for KV cache
  • Break-even: self-hosting typically beats API costs at 50M+ tokens/day for a 70B model

When Self-Hosting Makes Sense

Self-hosting is an infrastructure commitment

Self-hosting LLMs means you own GPU provisioning, model updates, scaling, and reliability. This is a significant operational burden. Only self-host when the benefits (privacy, cost, latency, customization) clearly outweigh the operational cost.

MotivationWhen It AppliesWhen APIs Are Better
Data privacyRegulated industries (healthcare, finance), PII in promptsNon-sensitive data, provider has BAA/DPA
Cost at scaleSpending >$10K/month on API callsSpending <$5K/month — operational overhead exceeds savings
LatencyNeed <200ms time-to-first-token consistentlyCan tolerate 500ms-1s TTFT from APIs
CustomizationFine-tuned models, custom architecturesOff-the-shelf models with prompt engineering
AvailabilityNeed 99.99% uptime with no dependency on external providers99.9% uptime acceptable, multi-provider failover works

The most common mistake is self-hosting too early. Teams spend weeks setting up GPU infrastructure when their API bill is $500/month. The break-even point depends on traffic volume: for a 7B model, self-hosting on a single A10G ($0.75/hr on AWS spot) breaks even around 10M tokens/day. For a 70B model on an A100, the break-even is around 50M tokens/day.