Production & Scale/Inference Optimization
Advanced15 min

Quantization

Choose the right quantization method for your hardware: FP8 as the 2026 production default on Hopper/Blackwell, AWQ-INT4 via Marlin kernels for Ampere, and GGUF for edge. Quantize with llm-compressor and GPTQModel, validate with task-specific evals, and monitor for quality drift in production.

Quick Reference

  • FP8 is the 2026 production default: <0.5% quality loss, no calibration required, 2× memory reduction on Hopper/Ada/Blackwell
  • Quantization reduces precision: BF16 (2 bytes) → FP8 (1 byte) → INT4 (0.5 bytes) — a 70B model drops from 140 GB to 35 GB at INT4
  • AWQ-INT4 with Marlin kernels achieves ~741 tok/s on A100 for Llama 3 70B — 10.9× faster than plain AWQ
  • llm-compressor (vLLM's official tool) replaces deprecated AutoGPTQ and AutoAWQ — use it for FP8, AWQ, and GPTQ
  • KV cache quantization (FP8 per-head scales) halves KV cache memory — critical for long-context serving at 32K+ tokens
  • NVFP4 is Blackwell-native FP4: ~3.3× memory reduction vs BF16 with quality between FP8 and INT4-AWQ
  • Quality loss from quantization is task-dependent — arithmetic, code, and long-context tasks degrade more than summarization

Should You Quantize?

Quantization is a self-hosting optimization. If you are using an API provider (OpenAI, Anthropic, etc.), you don't control the serving hardware and quantization decisions are made for you. Only read this article if you are already self-hosting a model — or seriously evaluating it. If you haven't made that decision yet, the inference-optimization/self-hosting article covers the break-even analysis.

Quantization is not free compression

Every quantization method has a quality cost, an eval-and-monitoring cost, and an operational cost. On Hopper/Blackwell hardware, FP8 is nearly free (no calibration, <0.5% loss). On older Ampere hardware, INT4 requires calibration data, an eval pass, and ongoing monitoring for drift. Neither path is zero-effort.

When to quantizeWhen NOT to quantize
Model doesn't fit on your GPU at BF16You are using an API provider
You need higher throughput from existing GPUsYour model fits at BF16 and throughput is sufficient
You want to serve a larger model on the same hardwareYou haven't benchmarked BF16 throughput yet
Serving cost is a hard constraintYou're on Hopper/Blackwell and haven't tried FP8 (try FP8 first)
GPU GenerationDefault PrecisionWhy
Blackwell (B100, B200)FP8 or NVFP4Native FP8 tensor cores + FP4 support; NVFP4 for maximum compression
Hopper / Ada (H100, H200, L40S)FP8Native FP8 tensor cores; 2× memory reduction with near-zero quality loss
Ampere (A100, A10, RTX 30/40xx)AWQ-INT4 + MarlinNo native FP8 tensor cores; Marlin-AWQ gives best throughput at INT4
CPU / Apple Silicon / edgeGGUF INT4–INT8llama.cpp with Metal/CUDA acceleration; GGUF is the standard format