Advanced16 min

LoRA & QLoRA

LoRA fine-tunes large models by training tiny adapter matrices instead of all weights. This article covers when to use it (and when not to), the math behind it, QLoRA's memory derivation, DoRA and Unsloth as 2026 best practices, and the failure modes — catastrophic forgetting, safety degradation, overfitting — that most tutorials skip.

Quick Reference

→LoRA freezes base weights and trains two small matrices (A: d×r, B: r×d) where r << d
→Trainable params: for d=4096, r=16 → A+B = 131K vs W = 16.7M (0.78% of one layer)
→QLoRA memory derivation: 70B params × 0.5 bytes/param (4-bit NF4) ≈ 35 GB base + ~13 GB for LoRA + optimizer
→DoRA (use_dora=True) decomposes updates into magnitude + direction — consistently outperforms vanilla LoRA at no inference cost
→target_modules='all-linear' applies LoRA to every linear layer — simpler than listing layers by name
→Unsloth backend: 2× faster training, 70% less VRAM vs standard PEFT, zero accuracy loss
→Catastrophic forgetting still occurs with LoRA — higher rank and more update steps = more forgetting
→Adapter files are 10–300 MB; merge into base model for production inference to eliminate the extra matmul

In this article

1.When NOT to Fine-Tune with LoRA
2.How LoRA Works
3.QLoRA: 4-Bit Base + 16-Bit Adapters
4.Hyperparameters That Actually Matter
5.Production Training Pipeline
6.How LoRA Fine-Tuning Fails
7.Evaluating Your Fine-Tune
8.Serving Adapters in Production
★Best Practices
✓Key Takeaways

When NOT to Fine-Tune with LoRA

Most teams that reach for LoRA should try something cheaper first. Fine-tuning is expensive to get right: you need training data, GPU hours, evaluation infrastructure, and ongoing maintenance as the base model updates. These alternatives are faster and often sufficient.

If your problem is…	Try first	Use LoRA when…
Output format (JSON, markdown, XML)	Structured output / JSON mode	The model consistently ignores schema constraints even with examples
Tone or style	System prompt + few-shot examples	You have 500+ labeled examples and the style is very specific
Domain knowledge gaps	RAG with up-to-date retrieval	Knowledge is stable, retrieval latency is unacceptable, or context is always the same
Task the model already does	Prompt engineering, few-shot	You've hit the ceiling of what prompting can deliver
Safety / refusals	Guardrail layer or system prompt	Almost never — LoRA can remove safety guardrails (see failure modes)

Fine-tuning is not a substitute for evaluation

The most common mistake: teams assume that if they fine-tune, the model will be better. Fine-tuning on low-quality data produces a model that confidently generates low-quality output. Build your evaluation harness before collecting training data, not after.

LoRA is the right tool when: you have 500–50,000 high-quality labeled examples, a clear task with measurable quality, and you've confirmed that prompting alone doesn't reach your quality threshold. If any of those three conditions is missing, start elsewhere.

How LoRA Works

In a transformer, most parameters live in large weight matrices — for example, a 4096×4096 attention projection. During fine-tuning, the change to these weights (ΔW) tends to be low-rank: it can be approximated by two much smaller matrices without significant quality loss. LoRA exploits this by representing the update as W + ΔW = W + B·A, where B is (d×r) and A is (r×d), with r significantly smaller than d.

0.8% of W

Frozen W (large) + trainable A·B (tiny) — zero inference overhead when merged

LoRA forward pass — the math made concrete

▸B is initialized to zero → adapter starts as identity, no change to base model output
▸Only A and B are updated during training; original weights W are completely frozen
▸scaling = alpha/rank controls adapter strength. alpha=2×rank is the common default
▸After training, merge for zero-overhead inference: W_merged = W + (alpha/r) × B @ A
▸Multiple adapters can share one base model and be swapped at runtime

QLoRA: 4-Bit Base + 16-Bit Adapters

QLoRA (Dettmers et al., 2023) combines two ideas: quantize the base model to 4-bit NF4 format (reducing its memory 4×), and add LoRA adapters in BF16 for training quality. The result: fine-tuning a 70B model on a single 80GB GPU.

QLoRA cuts memory 12× vs full fine-tuning by quantizing base weights and freezing gradients

The memory math: a 70B model in BF16 requires 70B × 2 bytes = 140 GB. At 4-bit (0.5 bytes/param), that drops to 35 GB. Add LoRA adapter gradients (~0.5 GB), Adam optimizer states for adapters (~1 GB), and activations (~8 GB with gradient checkpointing), and you fit in ~48 GB total — under one A100 80GB.

NF4 vs INT4

NF4 (NormalFloat4) uses quantile-based buckets matched to the normal distribution of typical weight values, whereas INT4 uses uniform buckets. NF4 loses less signal for normally-distributed weights. Double quantization (bnb_4bit_use_double_quant=True) further quantizes the quantization constants themselves, saving an extra 0.4 bits/param.

QLoRA setup with bitsandbytes + PEFT

Don't skip prepare_model_for_kbit_training

This function enables gradient checkpointing and ensures input/output embeddings are in a compatible dtype for the quantized model. Skipping it produces incorrect gradients or OOM errors that are hard to diagnose — the training loop often appears to start but produces garbage loss values.

Hyperparameters That Actually Matter

Hyperparameter	Typical range	Start here	Effect of increasing
rank (r)	4–128	16	More expressive; more memory; more forgetting risk
alpha	8–128	2× rank (r=16 → alpha=32)	Scales adapter strength. Higher = stronger push away from base
target_modules	Specific layers or 'all-linear'	'all-linear'	More modules = more expressive, more params
dropout	0.0–0.1	0.05	Regularization. Higher helps small datasets
learning rate	1e-5 to 5e-4	2e-4 for QLoRA	Higher = faster but risks instability

Rank selection guide

r=8–16 for format adaptation, style transfer, or instruction following. r=32–64 for domain adaptation where the distribution shift is significant. r=128 only if lower ranks provably underfit on your eval set — higher rank increases forgetting. Start low and increase only if quality is insufficient.

In 2026, the recommended starting config includes two changes from earlier guidance: use target_modules='all-linear' instead of manually listing layer names (it's architecture-agnostic and catches everything), and consider use_dora=True for better quality at the same parameter budget.

2026 recommended LoRA config with DoRA

What DoRA adds

DoRA (Weight-Decomposed Low-Rank Adaptation, ICML 2024 oral) decomposes each weight update into a magnitude component and a direction component, handling them separately. LoRA tends to either increase or decrease both proportionally; DoRA can make subtle directional adjustments while keeping magnitude stable. In practice it consistently outperforms vanilla LoRA across commonsense reasoning, visual instruction tuning, and other benchmarks at the same rank — and merges into the base model at inference so there's no runtime cost.

Production Training Pipeline

The standard 2026 training stack for LoRA/QLoRA is Unsloth + HuggingFace TRL. Unsloth rewrites PyTorch attention and MLP kernels in Triton, delivering 2–2.7× faster training and 60–70% less VRAM than standard PEFT — with zero accuracy loss because no approximations are made.

Production QLoRA training with Unsloth + TRL

Cost estimate (A100 80GB spot, April 2026)

Lambda Labs A100 80GB spot: ~$1.50/hr. Fine-tuning Llama 3.1 8B on 10,000 examples for 3 epochs with Unsloth takes roughly 2–3 hours = $3–5. The same run without Unsloth takes 4–6 hours = $6–9. For 70B with QLoRA (single A100): ~8–12 hours = $12–18. These estimates assume batch_size=4, grad_accum=4, sequence_length=2048. Your costs scale with sequence length.

Add an eval gate after each epoch: run your evaluation set and halt training if your target metric (task accuracy, ROUGE, or a custom LLM-as-judge score) doesn't improve for 2 consecutive epochs. Checkpointing the best model by eval loss alone is insufficient — validation loss and task accuracy can diverge.

How LoRA Fine-Tuning Fails

Real project

A team fine-tuned Llama 2 70B Chat on an adversarial dataset over a weekend. By Monday, all safety guardrails were gone — the model answered every harmful request that it had previously refused. LoRA doesn't protect frozen weights from distributional shift; it only reduces the parameter count of what changes. If your training data contains patterns that override safety tuning, LoRA will learn them just as effectively as full fine-tuning.

Learn this in → Safety fine-tuning is fragile. Always red-team your fine-tuned model before deployment, even if the base model passed all safety evals.

Failure mode	Symptom	Cause	Mitigation
Catastrophic forgetting	Fine-tune on medical QA → performance on general reasoning drops 10–15%	Higher rank + more update steps = more forgetting; power-law scaling with parameters fine-tuned	Use lower rank, fewer epochs, eval on held-out general tasks
Task overfitting	Loss reaches 0.1 on train, 2.3 on eval	Too few training examples, too many epochs, or data leaked into train	Add dropout (0.05–0.1), reduce epochs, verify train/eval split
Safety degradation	Model stops refusing harmful requests after fine-tuning	Fine-tuning overrides alignment tuning — even a small adversarial dataset suffices	Red-team after every fine-tune; use safety classifiers on outputs
Base model mismatch	Adapter trained on v1.0 produces garbage on v1.1	LoRA adapters are tied to exact base model weights — any update breaks compatibility	Pin base model version; retrain adapter if base model is updated
Rank too high for dataset	Good loss but worse than base model on your task	High rank with few examples memorizes noise rather than learning signal	Start at r=4 or r=8 and increase only with evidence

Catastrophic forgetting scales with rank

Research (Biderman et al., 2024) shows forgetting increases as a shifted power law with the number of parameters fine-tuned and the number of update steps. For LoRA specifically, higher rank = more parameters = more forgetting. If you care about keeping general capability, track performance on a held-out general benchmark alongside your task-specific metrics. Forgetting cannot be reliably avoided by early stopping alone.

Evaluating Your Fine-Tune

A fine-tuned model without evaluation infrastructure is a liability. You need three signals: task-specific quality, regression on general capability, and a human spot-check on a random sample.

▸Task metric on held-out test set: accuracy for classification, ROUGE/BLEU for generation, exact-match for structured output. This is your primary signal
▸Regression set: 50–100 examples from the base model's strong domains (general reasoning, math, code). Run before and after fine-tuning to catch forgetting
▸LLM-as-judge: for open-ended generation, use Claude or GPT-4 to rate outputs on a 1–5 scale on your specific criteria. More reliable than ROUGE for conversational tasks
▸Human spot-check: 20–30 random examples from the eval set, reviewed manually. Catches systematic errors that metrics miss
▸Safety eval: run your red-team prompts from baseline testing — any fine-tune that increases refusal bypass rate is not shippable

Minimal eval harness with LLM-as-judge

Set a minimum score gate before shipping

Define your pass threshold before training (e.g., mean judge score ≥ 4.0 AND pass_rate ≥ 80% AND regression score ≥ 95% of base). Evaluate on the test set once. If you evaluate repeatedly and adjust, you're overfitting to the test set. The threshold should come from your product requirements, not from what the model achieves.

Serving Adapters in Production

The production decision is: merge the adapter into the base model, or serve them separately with hot-swapping. Each has a clear use case.

Approach	When to use	Latency	Memory	Flexibility
Merge + redeploy	Single adapter, high traffic, latency-sensitive	Baseline (no LoRA overhead)	Base model only	None — new fine-tune = new deployment
vLLM multi-LoRA	Multiple adapters, shared base, moderate traffic	+1–3ms per request (adapter loading)	Base + all active adapters	Hot-swap without restart
Separate deployment	Very different tasks, different base models	Baseline per deployment	Full model per adapter	Full isolation

Merging an adapter for production deployment

vLLM multi-LoRA serving for multiple adapters

Version adapters with their training metadata

Store adapter files alongside: the base model ID and hash, training data version, hyperparameters used, and the eval metrics achieved. Use a model registry (Hugging Face Hub, MLflow, S3 with versioning). When a new base model version ships, flag all adapters trained on the previous version as incompatible — they are not safe to serve on the new base.

Best Practices

✓Start with QLoRA + Unsloth for cost-effective experimentation — fine-tune 8B models on a single RTX 4090 in 2–3 hours for $3–5
✓Use target_modules='all-linear' instead of listing layer names — it's architecture-agnostic and catches all linear layers automatically
✓Use use_dora=True for better quality at the same rank — DoRA consistently outperforms vanilla LoRA with no inference overhead
✓Set rank 16 as default and increase only with evidence from your eval set — lower rank generalizes better and causes less forgetting
✓Evaluate on a regression set (general reasoning, math, code) alongside your task metric to catch catastrophic forgetting early
✓Run red-team prompts from your baseline safety evaluation after every fine-tune before shipping
✓Version adapters with the base model ID, training data hash, and eval metrics in your model registry
✓Merge adapters into the base model for production inference when you have a single adapter — eliminates the extra matrix multiply

Don’t

✗Don't start with LoRA before trying prompt engineering and RAG — they're faster to iterate and often sufficient
✗Don't use rank > 64 without first testing lower ranks — higher rank causes more catastrophic forgetting per the power-law scaling result
✗Don't skip prepare_model_for_kbit_training with QLoRA — it enables gradient checkpointing and fixes embedding dtypes; skipping it causes incorrect gradients
✗Don't assume LoRA adapters transfer between base model versions — they are coupled to exact weight values; a base model update requires retraining
✗Don't use training loss as your primary eval signal — validate on a held-out test set with a task-specific metric; loss and task quality can diverge
✗Don't fine-tune to add knowledge the model will need to update — use RAG for volatile facts, LoRA for stable behaviors
✗Don't train for more epochs than needed — overfitting on small datasets and catastrophic forgetting both worsen with more update steps
✗Don't deploy without red-teaming safety — LoRA can remove safety guardrails as efficiently as it improves task quality

Key Takeaways

✓LoRA trains two tiny matrices (A: d×r, B: r×d) per targeted layer; for d=4096, r=16, that's 131K trainable params vs 16.7M frozen — 0.78% per layer.
✓QLoRA fits 70B fine-tuning on a single A100 80GB by quantizing base weights to 4-bit NF4 (~35 GB) and training LoRA adapters in BF16.
✓DoRA (use_dora=True in PEFT) decomposes updates into magnitude + direction, consistently outperforming vanilla LoRA at the same rank with no inference overhead.
✓Catastrophic forgetting scales as a power law with rank and update steps — track regression on general benchmarks alongside task metrics.
✓LoRA can remove safety guardrails as effectively as it improves task quality — red-team every fine-tuned model before deployment.
✓For production, merge single adapters into the base model to eliminate runtime overhead; use vLLM multi-LoRA only when hot-swapping between multiple adapters at runtime.

Video on this topic

Fine-tune a 70B model on one GPU with QLoRA — in 2026

tiktok

←

Training Data Engineering

End-to-End Fine-Tuning Pipeline

→