Production & Scale/Inference Optimization
Advanced12 min

GPU Cost Modeling

Understand the GPU landscape for LLM inference: compare A100, H100, L40S, and A10G on specs and pricing, calculate actual $/token for self-hosted models, model the break-even point against API providers, and optimize with spot instances.

Quick Reference

  • H100 has 3.35 TB/s memory bandwidth vs A100's 2.0 TB/s — the #1 factor for LLM inference speed
  • $/token formula: (GPU $/hr) / (tokens/hr at your batch size) = cost per token
  • Break-even vs API: typically 50-100M tokens/day for 70B models, 10-20M tokens/day for 7B models
  • Spot instances save 60-80% for batch workloads; reserved instances save 30-50% for steady-state serving
  • Memory bandwidth, not FLOPs, is the bottleneck for LLM decode — buy bandwidth, not compute

GPU Comparison for LLM Inference

GPUVRAMMemory BW (TB/s)FP16 TFLOPsAWS On-Demand ($/hr)AWS Spot ($/hr)Best For
NVIDIA A10G24 GB0.6125$1.00$0.357B models, low-cost serving
NVIDIA L424 GB0.3121$0.81$0.29Power-efficient inference, L4 pods
NVIDIA L40S48 GB0.86366$1.84$0.6513B-34B models, good $/perf
NVIDIA A100 40GB40 GB1.55312$3.67$1.1034B-70B models (INT4/INT8)
NVIDIA A100 80GB80 GB2.0312$4.60$1.3870B FP16, large batch sizes
NVIDIA H100 80GB80 GB3.35990$8.22$2.47Highest throughput, latency-critical
Memory bandwidth determines decode speed

During the decode phase (generating tokens one at a time), the GPU must read the entire model weights for every token generated. A 70B INT4 model (35 GB) on an A100 80GB (2.0 TB/s bandwidth) can theoretically generate at 2000/35 = ~57 tokens/s per request. On H100 (3.35 TB/s), the same model generates at ~96 tokens/s. The 67% bandwidth increase translates almost directly to 68% faster generation.

FLOPs matter for the prefill phase (processing the input prompt) but memory bandwidth dominates decode cost. Since most of the wall-clock time in LLM serving is spent in decode, memory bandwidth is the metric to optimize for. This is why the H100 is worth 2x the price of an A100 for latency-sensitive workloads — it is 67% faster on decode.