LLM Foundations/Fine-Tuning
Advanced11 min

LoRA & QLoRA

Parameter-efficient fine-tuning with LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA). How they work intuitively, why they need only 1-10% of the memory of full fine-tuning, how to choose hyperparameters (rank, alpha, target modules), and a complete configuration example with Hugging Face PEFT.

Quick Reference

  • LoRA: freeze base model weights, train small low-rank adapter matrices alongside them
  • Memory savings: LoRA uses 1-10% of the parameters of full fine-tuning, fitting large models on fewer GPUs
  • QLoRA: quantize base model to 4-bit, add LoRA adapters in 16-bit -- fine-tune 70B models on a single 48GB GPU
  • Key hyperparameters: rank (r=8-64), alpha (2x rank), target modules (q_proj, v_proj minimum)
  • LoRA adapters are small files (10-100 MB) that can be swapped without reloading the base model
  • Quality is 95-100% of full fine-tuning for most tasks at a fraction of the cost

Why Parameter-Efficient Fine-Tuning?

Full fine-tuning updates every parameter in the model. For a 70B model, that means computing gradients for 70 billion parameters, requiring multiple A100 GPUs and hundreds of gigabytes of memory. LoRA dramatically reduces this by freezing the original weights and only training small adapter matrices. The key insight is that the weight changes during fine-tuning are low-rank -- they can be approximated by much smaller matrices without significant quality loss.

MethodTrainable params (70B model)GPU memoryGPUs neededQuality vs full
Full fine-tuning70B (100%)~600 GB8x A100 80GB100% (baseline)
LoRA (r=16)~80M (0.1%)~160 GB2x A100 80GB~97-99%
QLoRA (r=16, 4-bit)~80M (0.1%)~48 GB1x A100 48GB~95-98%
Full fine-tuning (8B model)8B (100%)~80 GB1x A100 80GB100% (baseline)
QLoRA (8B, r=16, 4-bit)~20M (0.25%)~10 GB1x RTX 4090~95-98%
QLoRA is a game-changer

QLoRA makes it possible to fine-tune a 70B parameter model on a single consumer GPU (RTX 4090 with 24GB). The base model is quantized to 4-bit (NF4 format), reducing its memory footprint by 4x, while LoRA adapters are trained in 16-bit for quality. This democratized fine-tuning -- you no longer need a GPU cluster.

How LoRA Works (Intuitive Explanation)

In a transformer, most of the parameters are in large weight matrices (e.g., 4096x4096 for attention projections). During fine-tuning, the change to these weights (delta W) tends to be low-rank -- it can be decomposed into two much smaller matrices. LoRA exploits this by representing the weight change as W + delta_W = W + B * A, where B is (d x r) and A is (r x d), with r << d.

LoRA explained with code
  • B is initialized to zero, so the LoRA adaptation starts as an identity function (no change to base model)
  • During training, only A and B are updated. The original weights W are completely frozen
  • The scaling factor (alpha/rank) controls how much the adapter influences the output
  • After training, LoRA weights can be merged into the base model for zero-overhead inference: W_new = W + (alpha/r) * B @ A
  • Multiple LoRA adapters can be trained for different tasks and swapped at runtime without reloading the base model

Choosing LoRA Hyperparameters

HyperparameterTypical rangeDefault recommendationEffect of increasing
rank (r)4-25616-32More expressive but more memory and risk of overfitting
alpha8-1282x rank (r=16 -> alpha=32)Scales the LoRA contribution. Higher = stronger adaptation
target modulesVaries by modelq_proj, v_proj, k_proj, o_projMore modules = more expressive but more memory
dropout0.0-0.10.05Regularization against overfitting
learning rate1e-5 to 5e-42e-4 for QLoRAHigher = faster learning but risk of instability
Rank selection guide

For simple tasks (classification, format adaptation): r=8 is usually sufficient. For moderate tasks (style transfer, domain adaptation): r=16-32 works well. For complex tasks (learning new capabilities): r=64-128 may be needed. Start low and increase only if quality is insufficient -- lower rank trains faster and generalizes better.

Complete LoRA configuration with Hugging Face PEFT
Target module selection matters

Targeting only q_proj and v_proj (the minimum) saves memory but limits expressiveness. For most tasks, also include k_proj, o_proj, and the FFN projections (gate_proj, up_proj, down_proj). The difference between targeting 2 modules and 7 modules is typically 2-5% quality improvement for 3x more trainable parameters -- usually a worthwhile trade-off.

QLoRA: Quantized Base + LoRA Adapters

QLoRA combines two ideas: (1) quantize the base model to 4-bit precision (NF4 format), reducing memory by 4x, and (2) add LoRA adapters in higher precision (16-bit) for training quality. The result is fine-tuning a 70B model on a single 48GB GPU -- previously impossible without a GPU cluster.

QLoRA setup with bitsandbytes
Quality impact of quantization

4-bit quantization reduces the base model quality by roughly 1-3% on most benchmarks. The LoRA fine-tuning typically recovers this gap and then some, because it is optimizing for your specific task. In practice, QLoRA fine-tuned models are within 1-2% of full fine-tuning quality on most tasks -- a negligible difference for the massive memory savings.

Managing LoRA Adapters in Production

One of LoRA's practical advantages is that adapters are small, independent files that can be managed separately from the base model. This enables powerful patterns in production.

  • Adapter files are typically 10-100 MB, compared to 4-16 GB for a full model. Easy to version, store, and deploy
  • Multiple adapters can share one base model: customer-support-v1, legal-v1, medical-v1 all on the same Llama 3 instance
  • Hot-swapping: load a different adapter without restarting the server or reloading the base model (supported by vLLM, TGI)
  • A/B testing: serve different adapter versions to different user segments
  • Merge for production: for maximum inference speed, merge the adapter into the base model (eliminates the extra matrix multiply)
Loading and swapping LoRA adapters
Version your adapters

Treat LoRA adapters like model artifacts: version them with the training data version, hyperparameters, and evaluation metrics. Store them in a model registry (MLflow, Hugging Face Hub, S3 with versioning). This lets you roll back to a previous adapter version if a new one degrades quality.

Best Practices

Best Practices

Do

  • Start with QLoRA for cost-effective experimentation -- fine-tune 70B models on a single GPU
  • Use rank 16-32 as a default and only increase if quality is insufficient
  • Target all attention projections plus FFN projections for best quality
  • Version and store adapters separately from base models in a model registry
  • Merge adapters for production deployment when you need maximum inference speed

Don’t

  • Don't start with full fine-tuning -- LoRA achieves 95-99% of the quality at a fraction of the cost
  • Don't use very high rank (r>128) without evidence that lower rank is insufficient -- it wastes memory and risks overfitting
  • Don't skip the prepare_model_for_kbit_training step with QLoRA -- it handles critical gradient checkpointing setup
  • Don't assume LoRA adapters trained on one base model version work with another -- they are tied to the specific base model
  • Don't ignore learning rate -- QLoRA typically needs higher learning rates (2e-4) than full fine-tuning (2e-5)

Key Takeaways

  • LoRA trains small adapter matrices (0.1-1% of parameters) while freezing the base model, achieving 95-99% of full fine-tuning quality.
  • QLoRA combines 4-bit base model quantization with 16-bit LoRA adapters, enabling 70B model fine-tuning on a single GPU.
  • Key hyperparameters: rank 16-32, alpha 2x rank, target all attention + FFN projections for best results.
  • LoRA adapters are small (10-100 MB), versionable, and hot-swappable -- enabling multi-task deployment from one base model.
  • For production inference, merge adapters into the base model to eliminate the LoRA overhead.

Video on this topic

Fine-tune a 70B model on one GPU with QLoRA

tiktok