LoRA & QLoRA
LoRA fine-tunes large models by training tiny adapter matrices instead of all weights. This article covers when to use it (and when not to), the math behind it, QLoRA's memory derivation, DoRA and Unsloth as 2026 best practices, and the failure modes — catastrophic forgetting, safety degradation, overfitting — that most tutorials skip.
Quick Reference
- →LoRA freezes base weights and trains two small matrices (A: d×r, B: r×d) where r << d
- →Trainable params: for d=4096, r=16 → A+B = 131K vs W = 16.7M (0.78% of one layer)
- →QLoRA memory derivation: 70B params × 0.5 bytes/param (4-bit NF4) ≈ 35 GB base + ~13 GB for LoRA + optimizer
- →DoRA (use_dora=True) decomposes updates into magnitude + direction — consistently outperforms vanilla LoRA at no inference cost
- →target_modules='all-linear' applies LoRA to every linear layer — simpler than listing layers by name
- →Unsloth backend: 2× faster training, 70% less VRAM vs standard PEFT, zero accuracy loss
- →Catastrophic forgetting still occurs with LoRA — higher rank and more update steps = more forgetting
- →Adapter files are 10–300 MB; merge into base model for production inference to eliminate the extra matmul
In this article
When NOT to Fine-Tune with LoRA
Most teams that reach for LoRA should try something cheaper first. Fine-tuning is expensive to get right: you need training data, GPU hours, evaluation infrastructure, and ongoing maintenance as the base model updates. These alternatives are faster and often sufficient.
| If your problem is… | Try first | Use LoRA when… |
|---|---|---|
| Output format (JSON, markdown, XML) | Structured output / JSON mode | The model consistently ignores schema constraints even with examples |
| Tone or style | System prompt + few-shot examples | You have 500+ labeled examples and the style is very specific |
| Domain knowledge gaps | RAG with up-to-date retrieval | Knowledge is stable, retrieval latency is unacceptable, or context is always the same |
| Task the model already does | Prompt engineering, few-shot | You've hit the ceiling of what prompting can deliver |
| Safety / refusals | Guardrail layer or system prompt | Almost never — LoRA can remove safety guardrails (see failure modes) |
The most common mistake: teams assume that if they fine-tune, the model will be better. Fine-tuning on low-quality data produces a model that confidently generates low-quality output. Build your evaluation harness before collecting training data, not after.
LoRA is the right tool when: you have 500–50,000 high-quality labeled examples, a clear task with measurable quality, and you've confirmed that prompting alone doesn't reach your quality threshold. If any of those three conditions is missing, start elsewhere.
How LoRA Works
In a transformer, most parameters live in large weight matrices — for example, a 4096×4096 attention projection. During fine-tuning, the change to these weights (ΔW) tends to be low-rank: it can be approximated by two much smaller matrices without significant quality loss. LoRA exploits this by representing the update as W + ΔW = W + B·A, where B is (d×r) and A is (r×d), with r significantly smaller than d.
Frozen W (large) + trainable A·B (tiny) — zero inference overhead when merged
- ▸B is initialized to zero → adapter starts as identity, no change to base model output
- ▸Only A and B are updated during training; original weights W are completely frozen
- ▸scaling = alpha/rank controls adapter strength. alpha=2×rank is the common default
- ▸After training, merge for zero-overhead inference: W_merged = W + (alpha/r) × B @ A
- ▸Multiple adapters can share one base model and be swapped at runtime
QLoRA: 4-Bit Base + 16-Bit Adapters
QLoRA (Dettmers et al., 2023) combines two ideas: quantize the base model to 4-bit NF4 format (reducing its memory 4×), and add LoRA adapters in BF16 for training quality. The result: fine-tuning a 70B model on a single 80GB GPU.
QLoRA cuts memory 12× vs full fine-tuning by quantizing base weights and freezing gradients
The memory math: a 70B model in BF16 requires 70B × 2 bytes = 140 GB. At 4-bit (0.5 bytes/param), that drops to 35 GB. Add LoRA adapter gradients (~0.5 GB), Adam optimizer states for adapters (~1 GB), and activations (~8 GB with gradient checkpointing), and you fit in ~48 GB total — under one A100 80GB.
NF4 (NormalFloat4) uses quantile-based buckets matched to the normal distribution of typical weight values, whereas INT4 uses uniform buckets. NF4 loses less signal for normally-distributed weights. Double quantization (bnb_4bit_use_double_quant=True) further quantizes the quantization constants themselves, saving an extra 0.4 bits/param.
This function enables gradient checkpointing and ensures input/output embeddings are in a compatible dtype for the quantized model. Skipping it produces incorrect gradients or OOM errors that are hard to diagnose — the training loop often appears to start but produces garbage loss values.
Hyperparameters That Actually Matter
| Hyperparameter | Typical range | Start here | Effect of increasing |
|---|---|---|---|
| rank (r) | 4–128 | 16 | More expressive; more memory; more forgetting risk |
| alpha | 8–128 | 2× rank (r=16 → alpha=32) | Scales adapter strength. Higher = stronger push away from base |
| target_modules | Specific layers or 'all-linear' | 'all-linear' | More modules = more expressive, more params |
| dropout | 0.0–0.1 | 0.05 | Regularization. Higher helps small datasets |
| learning rate | 1e-5 to 5e-4 | 2e-4 for QLoRA | Higher = faster but risks instability |
r=8–16 for format adaptation, style transfer, or instruction following. r=32–64 for domain adaptation where the distribution shift is significant. r=128 only if lower ranks provably underfit on your eval set — higher rank increases forgetting. Start low and increase only if quality is insufficient.
In 2026, the recommended starting config includes two changes from earlier guidance: use target_modules='all-linear' instead of manually listing layer names (it's architecture-agnostic and catches everything), and consider use_dora=True for better quality at the same parameter budget.
DoRA (Weight-Decomposed Low-Rank Adaptation, ICML 2024 oral) decomposes each weight update into a magnitude component and a direction component, handling them separately. LoRA tends to either increase or decrease both proportionally; DoRA can make subtle directional adjustments while keeping magnitude stable. In practice it consistently outperforms vanilla LoRA across commonsense reasoning, visual instruction tuning, and other benchmarks at the same rank — and merges into the base model at inference so there's no runtime cost.
Production Training Pipeline
The standard 2026 training stack for LoRA/QLoRA is Unsloth + HuggingFace TRL. Unsloth rewrites PyTorch attention and MLP kernels in Triton, delivering 2–2.7× faster training and 60–70% less VRAM than standard PEFT — with zero accuracy loss because no approximations are made.
Lambda Labs A100 80GB spot: ~$1.50/hr. Fine-tuning Llama 3.1 8B on 10,000 examples for 3 epochs with Unsloth takes roughly 2–3 hours = $3–5. The same run without Unsloth takes 4–6 hours = $6–9. For 70B with QLoRA (single A100): ~8–12 hours = $12–18. These estimates assume batch_size=4, grad_accum=4, sequence_length=2048. Your costs scale with sequence length.
Add an eval gate after each epoch: run your evaluation set and halt training if your target metric (task accuracy, ROUGE, or a custom LLM-as-judge score) doesn't improve for 2 consecutive epochs. Checkpointing the best model by eval loss alone is insufficient — validation loss and task accuracy can diverge.
How LoRA Fine-Tuning Fails
A team fine-tuned Llama 2 70B Chat on an adversarial dataset over a weekend. By Monday, all safety guardrails were gone — the model answered every harmful request that it had previously refused. LoRA doesn't protect frozen weights from distributional shift; it only reduces the parameter count of what changes. If your training data contains patterns that override safety tuning, LoRA will learn them just as effectively as full fine-tuning.
Learn this in → Safety fine-tuning is fragile. Always red-team your fine-tuned model before deployment, even if the base model passed all safety evals.
| Failure mode | Symptom | Cause | Mitigation |
|---|---|---|---|
| Catastrophic forgetting | Fine-tune on medical QA → performance on general reasoning drops 10–15% | Higher rank + more update steps = more forgetting; power-law scaling with parameters fine-tuned | Use lower rank, fewer epochs, eval on held-out general tasks |
| Task overfitting | Loss reaches 0.1 on train, 2.3 on eval | Too few training examples, too many epochs, or data leaked into train | Add dropout (0.05–0.1), reduce epochs, verify train/eval split |
| Safety degradation | Model stops refusing harmful requests after fine-tuning | Fine-tuning overrides alignment tuning — even a small adversarial dataset suffices | Red-team after every fine-tune; use safety classifiers on outputs |
| Base model mismatch | Adapter trained on v1.0 produces garbage on v1.1 | LoRA adapters are tied to exact base model weights — any update breaks compatibility | Pin base model version; retrain adapter if base model is updated |
| Rank too high for dataset | Good loss but worse than base model on your task | High rank with few examples memorizes noise rather than learning signal | Start at r=4 or r=8 and increase only with evidence |
Research (Biderman et al., 2024) shows forgetting increases as a shifted power law with the number of parameters fine-tuned and the number of update steps. For LoRA specifically, higher rank = more parameters = more forgetting. If you care about keeping general capability, track performance on a held-out general benchmark alongside your task-specific metrics. Forgetting cannot be reliably avoided by early stopping alone.
Evaluating Your Fine-Tune
A fine-tuned model without evaluation infrastructure is a liability. You need three signals: task-specific quality, regression on general capability, and a human spot-check on a random sample.
- ▸Task metric on held-out test set: accuracy for classification, ROUGE/BLEU for generation, exact-match for structured output. This is your primary signal
- ▸Regression set: 50–100 examples from the base model's strong domains (general reasoning, math, code). Run before and after fine-tuning to catch forgetting
- ▸LLM-as-judge: for open-ended generation, use Claude or GPT-4 to rate outputs on a 1–5 scale on your specific criteria. More reliable than ROUGE for conversational tasks
- ▸Human spot-check: 20–30 random examples from the eval set, reviewed manually. Catches systematic errors that metrics miss
- ▸Safety eval: run your red-team prompts from baseline testing — any fine-tune that increases refusal bypass rate is not shippable
Define your pass threshold before training (e.g., mean judge score ≥ 4.0 AND pass_rate ≥ 80% AND regression score ≥ 95% of base). Evaluate on the test set once. If you evaluate repeatedly and adjust, you're overfitting to the test set. The threshold should come from your product requirements, not from what the model achieves.
Serving Adapters in Production
The production decision is: merge the adapter into the base model, or serve them separately with hot-swapping. Each has a clear use case.
| Approach | When to use | Latency | Memory | Flexibility |
|---|---|---|---|---|
| Merge + redeploy | Single adapter, high traffic, latency-sensitive | Baseline (no LoRA overhead) | Base model only | None — new fine-tune = new deployment |
| vLLM multi-LoRA | Multiple adapters, shared base, moderate traffic | +1–3ms per request (adapter loading) | Base + all active adapters | Hot-swap without restart |
| Separate deployment | Very different tasks, different base models | Baseline per deployment | Full model per adapter | Full isolation |
Store adapter files alongside: the base model ID and hash, training data version, hyperparameters used, and the eval metrics achieved. Use a model registry (Hugging Face Hub, MLflow, S3 with versioning). When a new base model version ships, flag all adapters trained on the previous version as incompatible — they are not safe to serve on the new base.
Best Practices
Do
- ✓Start with QLoRA + Unsloth for cost-effective experimentation — fine-tune 8B models on a single RTX 4090 in 2–3 hours for $3–5
- ✓Use target_modules='all-linear' instead of listing layer names — it's architecture-agnostic and catches all linear layers automatically
- ✓Use use_dora=True for better quality at the same rank — DoRA consistently outperforms vanilla LoRA with no inference overhead
- ✓Set rank 16 as default and increase only with evidence from your eval set — lower rank generalizes better and causes less forgetting
- ✓Evaluate on a regression set (general reasoning, math, code) alongside your task metric to catch catastrophic forgetting early
- ✓Run red-team prompts from your baseline safety evaluation after every fine-tune before shipping
- ✓Version adapters with the base model ID, training data hash, and eval metrics in your model registry
- ✓Merge adapters into the base model for production inference when you have a single adapter — eliminates the extra matrix multiply
Don’t
- ✗Don't start with LoRA before trying prompt engineering and RAG — they're faster to iterate and often sufficient
- ✗Don't use rank > 64 without first testing lower ranks — higher rank causes more catastrophic forgetting per the power-law scaling result
- ✗Don't skip prepare_model_for_kbit_training with QLoRA — it enables gradient checkpointing and fixes embedding dtypes; skipping it causes incorrect gradients
- ✗Don't assume LoRA adapters transfer between base model versions — they are coupled to exact weight values; a base model update requires retraining
- ✗Don't use training loss as your primary eval signal — validate on a held-out test set with a task-specific metric; loss and task quality can diverge
- ✗Don't fine-tune to add knowledge the model will need to update — use RAG for volatile facts, LoRA for stable behaviors
- ✗Don't train for more epochs than needed — overfitting on small datasets and catastrophic forgetting both worsen with more update steps
- ✗Don't deploy without red-teaming safety — LoRA can remove safety guardrails as efficiently as it improves task quality
Key Takeaways
- ✓LoRA trains two tiny matrices (A: d×r, B: r×d) per targeted layer; for d=4096, r=16, that's 131K trainable params vs 16.7M frozen — 0.78% per layer.
- ✓QLoRA fits 70B fine-tuning on a single A100 80GB by quantizing base weights to 4-bit NF4 (~35 GB) and training LoRA adapters in BF16.
- ✓DoRA (use_dora=True in PEFT) decomposes updates into magnitude + direction, consistently outperforming vanilla LoRA at the same rank with no inference overhead.
- ✓Catastrophic forgetting scales as a power law with rank and update steps — track regression on general benchmarks alongside task metrics.
- ✓LoRA can remove safety guardrails as effectively as it improves task quality — red-team every fine-tuned model before deployment.
- ✓For production, merge single adapters into the base model to eliminate runtime overhead; use vLLM multi-LoRA only when hot-swapping between multiple adapters at runtime.
Video on this topic
Fine-tune a 70B model on one GPU with QLoRA — in 2026
tiktok