Quantization
Reduce LLM memory requirements by 2-4x with quantization: understand the tradeoffs between GPTQ, GGUF, and AWQ, measure quality impact at different precision levels, and choose the right approach for your hardware and latency requirements.
Quick Reference
- →Quantization reduces model precision: FP16 (2 bytes) → INT8 (1 byte) → INT4 (0.5 bytes) per parameter
- →GPTQ: post-training quantization, GPU-optimized, best quality-to-compression ratio for GPU serving
- →GGUF: CPU-friendly format for llama.cpp/Ollama — ideal for local development, not production GPU serving
- →AWQ: activation-aware quantization preserves important weights at high precision — best quality at INT4
- →Quality impact: INT8 is near-lossless (<1% degradation), INT4 loses 3-8% on benchmarks depending on method
What Quantization Does
Quantization reduces the numerical precision of model weights. A standard FP16 model stores each parameter as a 16-bit floating point number (2 bytes). Quantizing to INT8 uses 1 byte, and INT4 uses 0.5 bytes. This directly reduces memory requirements (a 70B model goes from 140 GB to 70 GB at INT8 or 35 GB at INT4), enables serving on cheaper hardware, and often improves inference speed because less data needs to move through memory bandwidth.
| Precision | Bytes/Param | 70B Model Size | Quality Impact | Speed Impact |
|---|---|---|---|---|
| FP32 | 4 | 280 GB | Baseline (training precision) | Slowest (not used for inference) |
| FP16/BF16 | 2 | 140 GB | ~0% loss from FP32 | Baseline for inference |
| INT8 | 1 | 70 GB | <1% loss on most benchmarks | 1.2-1.5x faster than FP16 |
| INT4 (GPTQ) | 0.5 | 35 GB | 2-5% loss on reasoning tasks | 1.5-2x faster than FP16 |
| INT4 (AWQ) | 0.5 | 35 GB | 1-3% loss (better than GPTQ) | 1.5-2x faster than FP16 |
| INT3 | 0.375 | 26 GB | 5-15% loss — often too degraded | Fastest, but quality suffers |
Counterintuitively, lower precision is often faster even though the math operations are the same count. The reason is memory bandwidth: LLM inference is bandwidth-bound, and loading a 35 GB INT4 model from VRAM is 2x faster than loading a 70 GB FP16 model. Less data to move means faster generation.