GPU Cost Modeling
Understand the GPU landscape for LLM inference: compare A100, H100, L40S, and A10G on specs and pricing, calculate actual $/token for self-hosted models, model the break-even point against API providers, and optimize with spot instances.
Quick Reference
- →H100 has 3.35 TB/s memory bandwidth vs A100's 2.0 TB/s — the #1 factor for LLM inference speed
- →$/token formula: (GPU $/hr) / (tokens/hr at your batch size) = cost per token
- →Break-even vs API: typically 50-100M tokens/day for 70B models, 10-20M tokens/day for 7B models
- →Spot instances save 60-80% for batch workloads; reserved instances save 30-50% for steady-state serving
- →Memory bandwidth, not FLOPs, is the bottleneck for LLM decode — buy bandwidth, not compute
GPU Comparison for LLM Inference
| GPU | VRAM | Memory BW (TB/s) | FP16 TFLOPs | AWS On-Demand ($/hr) | AWS Spot ($/hr) | Best For |
|---|---|---|---|---|---|---|
| NVIDIA A10G | 24 GB | 0.6 | 125 | $1.00 | $0.35 | 7B models, low-cost serving |
| NVIDIA L4 | 24 GB | 0.3 | 121 | $0.81 | $0.29 | Power-efficient inference, L4 pods |
| NVIDIA L40S | 48 GB | 0.86 | 366 | $1.84 | $0.65 | 13B-34B models, good $/perf |
| NVIDIA A100 40GB | 40 GB | 1.55 | 312 | $3.67 | $1.10 | 34B-70B models (INT4/INT8) |
| NVIDIA A100 80GB | 80 GB | 2.0 | 312 | $4.60 | $1.38 | 70B FP16, large batch sizes |
| NVIDIA H100 80GB | 80 GB | 3.35 | 990 | $8.22 | $2.47 | Highest throughput, latency-critical |
During the decode phase (generating tokens one at a time), the GPU must read the entire model weights for every token generated. A 70B INT4 model (35 GB) on an A100 80GB (2.0 TB/s bandwidth) can theoretically generate at 2000/35 = ~57 tokens/s per request. On H100 (3.35 TB/s), the same model generates at ~96 tokens/s. The 67% bandwidth increase translates almost directly to 68% faster generation.
FLOPs matter for the prefill phase (processing the input prompt) but memory bandwidth dominates decode cost. Since most of the wall-clock time in LLM serving is spent in decode, memory bandwidth is the metric to optimize for. This is why the H100 is worth 2x the price of an A100 for latency-sensitive workloads — it is 67% faster on decode.