LLM Fine-Tuning with LoRA and QLoRA: A Practical Guide for Engineers

April 27, 2026

What LoRA and QLoRA Actually Do

Full fine-tuning updates every weight in the model. For a 7 billion parameter model, that means 7 billion values change during training, and you need enough GPU memory to hold the model, the gradients, and the optimizer states. That comes out to roughly 120 GB of VRAM for a 7B model with full fine-tuning. Most teams do not have that hardware sitting around.

LoRA (Low-Rank Adaptation) sidesteps the problem. Instead of updating all 7 billion weights, it freezes the original model and injects small trainable matrices into specific layers. These matrices are typically rank 8 to 64, meaning they have far fewer parameters. A LoRA adapter for a 7B model might add only 10 to 50 million trainable parameters. The rest stay frozen.

QLoRA goes one step further. It loads the frozen base model in 4-bit quantized format, cutting memory usage by roughly 4x. The LoRA adapters still train in 16-bit precision, so training quality stays close to full LoRA. The result: you can fine-tune a 7B model on a single 24 GB GPU (like an RTX 4090 or an A10G).

When to Use LoRA vs QLoRA

Use LoRA when you have access to 48 GB+ GPUs (A100 40GB, A6000) and want the highest quality fine-tune. LoRA in 16-bit is still the gold standard for quality.

Use QLoRA when GPU memory is tight. If you are training on consumer hardware, a single cloud GPU, or need to keep costs low, QLoRA is the practical choice. The quality difference between LoRA and QLoRA is measurable but small. In most benchmarks, QLoRA fine-tuned models score within 1 to 3 percent of full LoRA models.

Skip both and use full fine-tuning only if you have a multi-GPU cluster and need to modify the model's behavior at a deep level, such as changing its language or fundamentally altering its reasoning patterns.

Hardware Requirements at a Glance

Method	7B Model	13B Model	70B Model
Full fine-tune (16-bit)	~120 GB	~240 GB	Not practical on single node
LoRA (16-bit base)	~28 GB	~52 GB	~280 GB (multi-GPU)
QLoRA (4-bit base)	~10 GB	~18 GB	~48 GB

These numbers assume a batch size of 1 with gradient accumulation. Larger batch sizes increase memory usage linearly.

Step-by-Step: Fine-Tuning Llama 3 8B with QLoRA

This example uses Hugging Face's PEFT library with the transformers trainer.

Install dependencies

pip install transformers peft bitsandbytes datasets accelerate trl

Load the model in 4-bit

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Meta-Llama-3-8B',
quantization_config=bnb_config,
device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B')
tokenizer.pad_token = tokenizer.eos_token

Attach LoRA adapters

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,631,616 || trainable%: 0.1695

Prepare your dataset

from datasets import load_dataset

dataset = load_dataset('json', data_files='training_data.jsonl', split='train')

def format_instruction(example):
return {
'text': f"### Instruction:
{example['instruction']}

### Response:
{example['output']}"
}

dataset = dataset.map(format_instruction)

Your training data should be a JSONL file with instruction/output pairs. Quality matters far more than volume. 500 well-written, diverse examples often outperform 5,000 noisy ones.

Train

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=2e-4,
warmup_steps=50,
logging_steps=10,
save_strategy='epoch',
bf16=True
)

trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
tokenizer=tokenizer,
dataset_text_field='text',
max_seq_length=2048
)

trainer.train()

Training a 8B model with QLoRA on 1,000 examples takes about 30 to 60 minutes on a single A100. On an RTX 4090, expect 1 to 2 hours.

Key Hyperparameters to Tune

Rank (r): Start with 16. Higher ranks (32, 64) capture more complex adaptations but use more memory. Going above 64 rarely helps.
Alpha: Usually set to 2x the rank. Alpha 32 with rank 16 is a common starting point.
Learning rate: 1e-4 to 3e-4 for QLoRA. Too high and training destabilizes. Too low and the adapter barely changes.
Target modules: Attention projections (q_proj, k_proj, v_proj, o_proj) are the standard targets. Adding MLP layers (gate_proj, up_proj, down_proj) increases capacity at the cost of more trainable parameters.

After Training: Merge and Deploy

You can serve the adapter separately or merge it back into the base model for simpler deployment:

merged_model = model.merge_and_unload()
merged_model.save_pretrained('./merged-model')
tokenizer.save_pretrained('./merged-model')

The merged model behaves like any other Hugging Face model. You can load it with vLLM, TGI, or Ollama for serving.

Frequently Asked Questions

Does QLoRA quality suffer compared to full fine-tuning?

Slightly. The original QLoRA paper showed it matched full 16-bit fine-tuning on several benchmarks, but real-world results vary by task. For most business applications, the difference is negligible.

Can I stack multiple LoRA adapters?

Yes. PEFT supports loading multiple adapters and switching between them at inference time. This lets you fine-tune for different tasks or clients without duplicating the base model.

What if my training data has fewer than 100 examples?

You can still fine-tune, but expect limited generalization. Below 100 examples, consider few-shot prompting or RAG instead. If you must fine-tune with little data, increase the number of epochs to 5 to 10 and watch for overfitting using a held-out validation set.