Production & Scale/Inference Optimization
Advanced11 min

Quantization

Reduce LLM memory requirements by 2-4x with quantization: understand the tradeoffs between GPTQ, GGUF, and AWQ, measure quality impact at different precision levels, and choose the right approach for your hardware and latency requirements.

Quick Reference

  • Quantization reduces model precision: FP16 (2 bytes) → INT8 (1 byte) → INT4 (0.5 bytes) per parameter
  • GPTQ: post-training quantization, GPU-optimized, best quality-to-compression ratio for GPU serving
  • GGUF: CPU-friendly format for llama.cpp/Ollama — ideal for local development, not production GPU serving
  • AWQ: activation-aware quantization preserves important weights at high precision — best quality at INT4
  • Quality impact: INT8 is near-lossless (<1% degradation), INT4 loses 3-8% on benchmarks depending on method

What Quantization Does

Quantization reduces the numerical precision of model weights. A standard FP16 model stores each parameter as a 16-bit floating point number (2 bytes). Quantizing to INT8 uses 1 byte, and INT4 uses 0.5 bytes. This directly reduces memory requirements (a 70B model goes from 140 GB to 70 GB at INT8 or 35 GB at INT4), enables serving on cheaper hardware, and often improves inference speed because less data needs to move through memory bandwidth.

PrecisionBytes/Param70B Model SizeQuality ImpactSpeed Impact
FP324280 GBBaseline (training precision)Slowest (not used for inference)
FP16/BF162140 GB~0% loss from FP32Baseline for inference
INT8170 GB<1% loss on most benchmarks1.2-1.5x faster than FP16
INT4 (GPTQ)0.535 GB2-5% loss on reasoning tasks1.5-2x faster than FP16
INT4 (AWQ)0.535 GB1-3% loss (better than GPTQ)1.5-2x faster than FP16
INT30.37526 GB5-15% loss — often too degradedFastest, but quality suffers
Quantization improves speed because of memory bandwidth

Counterintuitively, lower precision is often faster even though the math operations are the same count. The reason is memory bandwidth: LLM inference is bandwidth-bound, and loading a 35 GB INT4 model from VRAM is 2x faster than loading a 70 GB FP16 model. Less data to move means faster generation.