LoRA & QLoRA
Parameter-efficient fine-tuning with LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA). How they work intuitively, why they need only 1-10% of the memory of full fine-tuning, how to choose hyperparameters (rank, alpha, target modules), and a complete configuration example with Hugging Face PEFT.
Quick Reference
- →LoRA: freeze base model weights, train small low-rank adapter matrices alongside them
- →Memory savings: LoRA uses 1-10% of the parameters of full fine-tuning, fitting large models on fewer GPUs
- →QLoRA: quantize base model to 4-bit, add LoRA adapters in 16-bit -- fine-tune 70B models on a single 48GB GPU
- →Key hyperparameters: rank (r=8-64), alpha (2x rank), target modules (q_proj, v_proj minimum)
- →LoRA adapters are small files (10-100 MB) that can be swapped without reloading the base model
- →Quality is 95-100% of full fine-tuning for most tasks at a fraction of the cost
In this article
Why Parameter-Efficient Fine-Tuning?
Full fine-tuning updates every parameter in the model. For a 70B model, that means computing gradients for 70 billion parameters, requiring multiple A100 GPUs and hundreds of gigabytes of memory. LoRA dramatically reduces this by freezing the original weights and only training small adapter matrices. The key insight is that the weight changes during fine-tuning are low-rank -- they can be approximated by much smaller matrices without significant quality loss.
| Method | Trainable params (70B model) | GPU memory | GPUs needed | Quality vs full |
|---|---|---|---|---|
| Full fine-tuning | 70B (100%) | ~600 GB | 8x A100 80GB | 100% (baseline) |
| LoRA (r=16) | ~80M (0.1%) | ~160 GB | 2x A100 80GB | ~97-99% |
| QLoRA (r=16, 4-bit) | ~80M (0.1%) | ~48 GB | 1x A100 48GB | ~95-98% |
| Full fine-tuning (8B model) | 8B (100%) | ~80 GB | 1x A100 80GB | 100% (baseline) |
| QLoRA (8B, r=16, 4-bit) | ~20M (0.25%) | ~10 GB | 1x RTX 4090 | ~95-98% |
QLoRA makes it possible to fine-tune a 70B parameter model on a single consumer GPU (RTX 4090 with 24GB). The base model is quantized to 4-bit (NF4 format), reducing its memory footprint by 4x, while LoRA adapters are trained in 16-bit for quality. This democratized fine-tuning -- you no longer need a GPU cluster.
How LoRA Works (Intuitive Explanation)
In a transformer, most of the parameters are in large weight matrices (e.g., 4096x4096 for attention projections). During fine-tuning, the change to these weights (delta W) tends to be low-rank -- it can be decomposed into two much smaller matrices. LoRA exploits this by representing the weight change as W + delta_W = W + B * A, where B is (d x r) and A is (r x d), with r << d.
- ▸B is initialized to zero, so the LoRA adaptation starts as an identity function (no change to base model)
- ▸During training, only A and B are updated. The original weights W are completely frozen
- ▸The scaling factor (alpha/rank) controls how much the adapter influences the output
- ▸After training, LoRA weights can be merged into the base model for zero-overhead inference: W_new = W + (alpha/r) * B @ A
- ▸Multiple LoRA adapters can be trained for different tasks and swapped at runtime without reloading the base model
Choosing LoRA Hyperparameters
| Hyperparameter | Typical range | Default recommendation | Effect of increasing |
|---|---|---|---|
| rank (r) | 4-256 | 16-32 | More expressive but more memory and risk of overfitting |
| alpha | 8-128 | 2x rank (r=16 -> alpha=32) | Scales the LoRA contribution. Higher = stronger adaptation |
| target modules | Varies by model | q_proj, v_proj, k_proj, o_proj | More modules = more expressive but more memory |
| dropout | 0.0-0.1 | 0.05 | Regularization against overfitting |
| learning rate | 1e-5 to 5e-4 | 2e-4 for QLoRA | Higher = faster learning but risk of instability |
For simple tasks (classification, format adaptation): r=8 is usually sufficient. For moderate tasks (style transfer, domain adaptation): r=16-32 works well. For complex tasks (learning new capabilities): r=64-128 may be needed. Start low and increase only if quality is insufficient -- lower rank trains faster and generalizes better.
Targeting only q_proj and v_proj (the minimum) saves memory but limits expressiveness. For most tasks, also include k_proj, o_proj, and the FFN projections (gate_proj, up_proj, down_proj). The difference between targeting 2 modules and 7 modules is typically 2-5% quality improvement for 3x more trainable parameters -- usually a worthwhile trade-off.
QLoRA: Quantized Base + LoRA Adapters
QLoRA combines two ideas: (1) quantize the base model to 4-bit precision (NF4 format), reducing memory by 4x, and (2) add LoRA adapters in higher precision (16-bit) for training quality. The result is fine-tuning a 70B model on a single 48GB GPU -- previously impossible without a GPU cluster.
4-bit quantization reduces the base model quality by roughly 1-3% on most benchmarks. The LoRA fine-tuning typically recovers this gap and then some, because it is optimizing for your specific task. In practice, QLoRA fine-tuned models are within 1-2% of full fine-tuning quality on most tasks -- a negligible difference for the massive memory savings.
Managing LoRA Adapters in Production
One of LoRA's practical advantages is that adapters are small, independent files that can be managed separately from the base model. This enables powerful patterns in production.
- ▸Adapter files are typically 10-100 MB, compared to 4-16 GB for a full model. Easy to version, store, and deploy
- ▸Multiple adapters can share one base model: customer-support-v1, legal-v1, medical-v1 all on the same Llama 3 instance
- ▸Hot-swapping: load a different adapter without restarting the server or reloading the base model (supported by vLLM, TGI)
- ▸A/B testing: serve different adapter versions to different user segments
- ▸Merge for production: for maximum inference speed, merge the adapter into the base model (eliminates the extra matrix multiply)
Treat LoRA adapters like model artifacts: version them with the training data version, hyperparameters, and evaluation metrics. Store them in a model registry (MLflow, Hugging Face Hub, S3 with versioning). This lets you roll back to a previous adapter version if a new one degrades quality.
Best Practices
Do
- ✓Start with QLoRA for cost-effective experimentation -- fine-tune 70B models on a single GPU
- ✓Use rank 16-32 as a default and only increase if quality is insufficient
- ✓Target all attention projections plus FFN projections for best quality
- ✓Version and store adapters separately from base models in a model registry
- ✓Merge adapters for production deployment when you need maximum inference speed
Don’t
- ✗Don't start with full fine-tuning -- LoRA achieves 95-99% of the quality at a fraction of the cost
- ✗Don't use very high rank (r>128) without evidence that lower rank is insufficient -- it wastes memory and risks overfitting
- ✗Don't skip the prepare_model_for_kbit_training step with QLoRA -- it handles critical gradient checkpointing setup
- ✗Don't assume LoRA adapters trained on one base model version work with another -- they are tied to the specific base model
- ✗Don't ignore learning rate -- QLoRA typically needs higher learning rates (2e-4) than full fine-tuning (2e-5)
Key Takeaways
- ✓LoRA trains small adapter matrices (0.1-1% of parameters) while freezing the base model, achieving 95-99% of full fine-tuning quality.
- ✓QLoRA combines 4-bit base model quantization with 16-bit LoRA adapters, enabling 70B model fine-tuning on a single GPU.
- ✓Key hyperparameters: rank 16-32, alpha 2x rank, target all attention + FFN projections for best results.
- ✓LoRA adapters are small (10-100 MB), versionable, and hot-swappable -- enabling multi-task deployment from one base model.
- ✓For production inference, merge adapters into the base model to eliminate the LoRA overhead.
Video on this topic
Fine-tune a 70B model on one GPU with QLoRA
tiktok