GPU Cost Modeling

Model the real cost of GPU inference in 2026: rank GPUs by bandwidth-per-dollar (not specs), calculate honest $/token with KV cache math at actual context lengths, compute break-even against current API pricing (GPT-5.4 at $2.50/$15, Claude Sonnet 4.6 at $3/$15), identify the five ways cost models lie, and deploy the base-burst-failover pattern for production cost optimization.

Quick Reference

→Bandwidth-per-dollar ranks GPUs better than specs: A100 80GB at $3.40/hr delivers 0.59 GB/s per dollar — rivaling the A10G
→$/token formula: (GPU $/hr) / (tokens/hr at your actual batch size and context length) = cost per token
→KV cache at 128K context is ~40× larger than at 4K — your cost model MUST parameterize context length
→Break-even vs frontier APIs: ~7-10M tokens/day for 70B models (GPT-5.4 at $2.50/$15.00), ~15M for 8B models
→Total cost of ownership is 1.3-1.5× raw GPU cost: add networking, storage, monitoring, and engineering time
→Spot instances save 60-80%; combine with reserved base + API failover for production resilience
→H200 (4.8 TB/s, 141 GB) and B200 (8.0 TB/s, 192 GB) are now available — H200 is the decode-speed leader
→Refresh your cost model every 90 days — GPU pricing dropped 15-20% in a single year

When GPU Cost Modeling Matters (and When to Skip It)

GPU cost modeling is for teams spending more than $5K/month on inference APIs, or actively evaluating self-hosting. Below that threshold, the engineering time to build and maintain a cost model exceeds the potential savings. Above it, a wrong model costs more than no model at all — because it gives false confidence.

Your Situation	Cost Model Needed?	Reason
<5M tokens/day on any model	No	API spend is too low for self-hosting to make financial sense
>20M tokens/day on a consistent model	Yes	Break-even is likely within 6-9 months
Changing models quarterly (evals in flux)	No	Model churn makes infrastructure cost unpredictable
Regulated data that can't leave your VPC	Yes	Cost is a secondary driver — compliance is primary
Batch-only workloads (offline evals, embeddings)	Yes	Strongest ROI case — spot instances with no latency SLA

Cost models have a shelf life

GPU prices dropped 15-20% between June 2025 and April 2026 (AWS cut P5/P4d instances up to 45%). Any cost model older than 6 months is a liability. Build it to pull live pricing from a config, not hard-coded constants.

The GPU Landscape (April 2026)

The right metric for LLM inference is not raw FLOPs or VRAM — it is memory bandwidth per dollar-hour. Decode-phase performance scales directly with how fast the GPU can read model weights from VRAM. The table below ranks GPUs by this metric using April 2026 AWS on-demand pricing.

Memory Bandwidth: The Decode Bottleneck

During decode, the GPU reads the entire model's weights from VRAM for every single output token. A 70B INT4 model is 35 GB. An A100 80GB has 2,000 GB/s bandwidth. Theoretical max: 2000 / 35 = ~57 tokens/s per stream. An H100 at 3,350 GB/s gives ~96 tokens/s. An H200 at 4,800 GB/s gives ~137 tokens/s. The formula is: `max_decode_tps = bandwidth_GB_s / model_size_GB`. FLOPs matter for the prefill phase (processing the input prompt, which scales with prompt length), but wall-clock time in most serving scenarios is dominated by decode.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.