A complete guide to teaching AI models new skills: supervised fine-tuning (SFT), LoRA/QLoRA, RLHF, DPO, GRPO, model distillation, model merging, and evaluation. From concept to production — with working code at every step.
Pretraining gives a model broad knowledge of the world, but only one skill: predicting the next token. The model has seen Wikipedia, code, books, and the web — but it doesn't know to be helpful, to follow instructions, or to refuse dangerous requests. Fine-tuning is the process of teaching these behaviors after pretraining.
The industry has converged on a standard training ladder that all major frontier models (GPT-4o, Claude Opus 4.6, Llama 4, Gemini 2.5) follow. Each stage builds on the previous — you cannot skip SFT and jump straight to RLHF.
graph LR A[Raw Text Corpus] -->|Pretraining cross-entropy| B[Base Model] B -->|Supervised Fine-Tuning| C[Instruction-Following Model] C -->|RLHF / DPO / GRPO| D[Aligned Model] D -->|Evaluation & Red-teaming| E[Production Model]
Self-supervised next-token prediction on massive corpora. Encodes world knowledge.
Supervised fine-tuning on instruction-response pairs. Teaches the model to be helpful.
RLHF, DPO, or GRPO on human preference data. Makes outputs safe and preferred.
Automated benchmarks + red-teaming. Catch regressions before shipping.
SFT trains the model to predict assistant tokens given a conversation context. The key detail is loss masking: the cross-entropy loss is computed only on assistant tokens, not on the system prompt or user turns. This prevents the model from “learning” the user's side of the conversation.
Three formats dominate the SFT landscape. ChatML has become the most widely adopted due to its unambiguous special tokens.
<|im_start|>system You are a helpful AI assistant specialized in European AI regulation. <|im_end|> <|im_start|>user What are the key obligations under the EU AI Act for high-risk systems? <|im_end|> <|im_start|>assistant High-risk AI systems under the EU AI Act (in force August 2024) must comply with... <|im_end|>
| Parameter | Typical Value | Notes |
|---|---|---|
| Learning rate | 2e-5 | Lower than pretraining; cosine decay |
| Epochs | 2–3 | More epochs → overfitting on small datasets |
| Batch size (effective) | 64–128 | Use gradient accumulation for small GPU memory |
| Warmup ratio | 0.1 | 10% of steps for LR warmup |
| Max sequence length | 2048–8192 | Match your inference context window |
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
import torch
model_name = "meta-llama/Llama-4-Scout-17B-16E-Instruct" # 2026: Llama 4 Scout replaces Llama 3.1 8B
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
sft_config = SFTConfig(
output_dir="./sft-llama-4-scout",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=2e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
logging_steps=10,
save_strategy="epoch",
bf16=True,
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
trainer.save_model()Full fine-tuning modifies all ~7 billion parameters of a 7B model. At bfloat16 that's 14 GB just for parameter storage, plus gradients and optimizer states. LoRA (Low-Rank Adaptation, Hu et al. 2021) exploits a key empirical observation: weight changes during fine-tuning are low-rank.
Instead of learning a full weight update ΔW ∈ ℜd×k, LoRA learns two small matrices: A ∈ ℜd×r and B ∈ ℜr×k where r << min(d, k). At inference, the adapter is folded back: W′ = W + αAB/r. Once merged, there is zero inference overhead.
from peft import LoraConfig, TaskType, get_peft_model
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,030,261,248 || trainable%: 1.044
# After training, merge adapter back into the base weights
merged = model.merge_and_unload()
merged.save_pretrained("./my-lora-merged")| Method | Trainable Params | GPU RAM (8B) | Quality | Training Speed |
|---|---|---|---|---|
| Full Fine-Tuning | 7B (100%) | ~80 GB | Best | Slowest |
| LoRA r=4 | ~21M (0.3%) | ~16 GB | Good | Fast |
| LoRA r=16 | ~83M (1.0%) | ~18 GB | Very Good | Fast |
| LoRA r=64 | ~335M (4.1%) | ~24 GB | Near Full FT | Moderate |
use_dora=True in LoraConfig.Even with LoRA, the base model loaded at bfloat16 requires 16 GB for a 8B model — beyond consumer GPU budgets. QLoRA (Dettmers et al. 2023) solves this by quantizing the frozen base model to 4-bit NormalFloat (NF4) and training LoRA adapters at bfloat16 precision.
NormalFloat4 is information-theoretically optimal for normally-distributed neural network weights. Less error than int4 or fp4.
Optimizer states automatically page to CPU RAM when GPU memory fills, preventing OOM crashes during training.
Quantizes the quantization constants themselves, saving an extra ~0.5 bits per parameter.
| Model | FP16 VRAM | QLoRA VRAM | Min GPU |
|---|---|---|---|
| Llama 4 Scout (17B) | 34 GB | 10 GB | RTX 4090 24GB |
| Llama 4 Maverick (70B-class) | 140 GB | 40 GB | 2× A100 40GB |
| Llama 4 Behemoth (frontier) | 800+ GB | ~200 GB | 8× H100 80GB |
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-4-Maverick-17B-128E-Instruct", # 2026: Llama 4 Maverick replaces Llama 3.1 70B
quantization_config=bnb_config,
device_map="auto",
)
# Now apply LoRA to the 4-bit model — same LoraConfig + get_peft_model as beforeReinforcement Learning from Human Feedback (RLHF) was the breakthrough that turned GPT-3 into InstructGPT and eventually GPT-4o. It aligns model behavior to human preferences — not just instruction following, but making outputs genuinely preferred, safe, and helpful.
Fine-tune the base model on a curated set of high-quality instruction-following demos. This creates the starting policy that RLHF will improve.
Train a classifier on pairwise human preferences: given two completions (y_w, y_l) to the same prompt, which is better? Loss: log σ(r(x, y_w) − r(x, y_l)).
Use Proximal Policy Optimization to maximize the reward model score while staying close to the SFT policy (KL divergence penalty prevents reward hacking).
graph LR A[Base Model] -->|SFT on demos| B[SFT Model] B -->|Sample completions| C[Completion Pairs] C -->|Human labelers rank| D[Preference Dataset] D -->|Train| E[Reward Model] B -->|Initialize policy| F[Policy Model] F -->|Rollout + PPO| G[RL Optimization] E -->|Score rollouts| G G -->|Converged| H[RLHF Model]
DPO (Direct Preference Optimization) (Rafailov et al. 2023) eliminates the reward model entirely. It showed mathematically that the optimal RLHF policy can be expressed directly as a function of the preference data, collapsing a three-stage pipeline into a single fine-tuning step.
The DPO loss directly optimizes the policy on preference pairs (prompt, chosen, rejected) using the SFT model as a frozen reference. No PPO, no reward model, no separate RM training data collection.
from trl import DPOConfig, DPOTrainer
from datasets import load_dataset
# Dataset needs: prompt, chosen, rejected columns
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
dpo_config = DPOConfig(
output_dir="./dpo-output",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-7, # much smaller than SFT lr
beta=0.1, # KL penalty coefficient
bf16=True,
)
trainer = DPOTrainer(
model=sft_model, # your SFT fine-tuned model
ref_model=sft_ref_model, # frozen reference
args=dpo_config,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()Group Relative Policy Optimization (GRPO) (used in DeepSeek-R1) eliminates the reference model. For each prompt, it samples multiple outputs and uses the group mean reward as the baseline for advantage estimation. This is cheaper than PPO (no value model) and better suited for reasoning tasks where you can verify correctness programmatically.
| Method | Compute | Stability | Data Requirements | Notes |
|---|---|---|---|---|
| RLHF (PPO) | Very High | Low | Human rankings | 4 models in memory; reward hacking risk |
| DPO | Low | High | Preference pairs | No reward model; simpler pipeline |
| GRPO | Medium | Medium | Rollout samples | No reference model; good for reasoning |
| SimPO | Low | High | Preference pairs | No reference model; avg log prob reward |
Knowledge distillation trains a small “student” model to mimic a large “teacher” model. The key insight is that the teacher provides soft probability distributions over the vocabulary (logits) rather than one-hot labels. These soft targets encode far more information — they reveal which tokens are semantically similar to the correct answer, giving the student a richer training signal.
The combined loss: L = α × LCE(hard labels) + (1 − α) × LKL(student logits ‖ teacher logits). Temperature scaling T > 1 softens the teacher distribution, spreading probability mass across more tokens and making the soft labels even more informative.
graph TB A["Large Teacher (70B)"] -->|"Generate on training data"| B[Soft Logits] C[Input Prompt] --> A C --> D["Small Student (7B)"] B -->|KL Loss| D E[Ground Truth] -->|CE Loss| D D -->|Both losses| F[Distilled Student]
Student imitates teacher outputs — generate teacher completions, train student to reproduce them. Used by DeepSeek-R1-Distill to transfer reasoning traces.
Match intermediate representations (hidden states, attention patterns) between teacher and student layers. Transfers structural knowledge, not just surface outputs.
A small draft model proposes token sequences; the large model verifies them in parallel. Achieves 2–4x inference speedup with no quality loss.
The student generates tokens; the teacher scores them. Avoids exposure bias (train-test distribution mismatch) common in offline distillation.
Model merging combines multiple fine-tuned checkpoints into a single model without any additional training. It's cheap, fast, and surprisingly effective for combining specialized skills — code, math, instruction following — into one deployable model. Merged models frequently appear at the top of the HuggingFace Open LLM Leaderboard.
Smooth interpolation between two model checkpoints in weight space. Treats weights as points on a hypersphere. Best for blending two closely-related models.
Compute ΔW = W_FT − W_base for each fine-tuned model, then add deltas together. Lets you compose capabilities or subtract undesirable behaviors.
Resolves conflicts between models: trim small-magnitude parameters, elect the dominant sign for each weight, then merge. Handles 3+ models cleanly.
Randomly drops fine-tuning weight deltas (with probability p) and rescales the survivors to preserve the norm. Reduces interference between models.
# mergekit config.yaml
models:
- model: meta-llama/Llama-4-Scout-17B-16E
parameters:
weight: 0.4
- model: ./llama-4-scout-code-finetuned
parameters:
weight: 0.3
- model: ./llama-4-scout-math-finetuned
parameters:
weight: 0.3
merge_method: ties
base_model: meta-llama/Llama-4-Scout-17B-16E
parameters:
density: 0.7
normalize: truemergekit-yaml config.yaml ./merged-model --cuda
passthrough merge method.Dataset quality is the single most important factor in fine-tuning success — more important than model architecture, training duration, or optimizer choice. A poorly curated dataset guarantees poor results regardless of everything else.
Expert-authored examples; highest signal-to-noise ratio. Used for critical behaviors.
Synthetic generation with frontier models. Good for bootstrapping domain coverage at scale.
Evolve seed instructions into harder, more diverse variants. Used in WizardLM and OpenHermes.
Requires aggressive quality filtering: deduplication, length filter, perplexity filter, safety filter.
{
"conversations": [
{"from": "system", "value": "You are an expert in EU AI regulation."},
{"from": "human", "value": "Explain the risk categories in the EU AI Act."},
{"from": "gpt", "value": "The EU AI Act categorizes AI systems into four risk levels..."}
]
}from openai import OpenAI # or use Mistral/Llama locally
client = OpenAI()
def generate_training_example(topic: str, difficulty: str) -> dict:
prompt = (
f"Generate a challenging {difficulty}-level question about {topic} "
"and a comprehensive expert answer."
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.8,
)
content = response.choices[0].message.content
# Parse and structure output (question/answer split)...
return {"instruction": topic, "response": content}The fine-tuning loop is: train → evaluate on holdout → diagnose failure modes → improve data → retrain. Good evaluation is what transforms trial-and-error into systematic improvement.
80-question multi-turn benchmark across 8 categories (writing, math, coding, etc.). GPT-4 scores each response 1–10.
Win rate of your model vs. a reference model (GPT-4o) as judged by GPT-4o. Fast automated evaluation of instruction-following quality.
Instruction-following accuracy on verifiable constraints (e.g., 'respond in fewer than 100 words'). Strict and loose scoring variants.
Code generation benchmarks. Pass@k metric: fraction of problems solved in k attempts. Ground-truth executable test cases.
import json
from openai import OpenAI
client = OpenAI()
def evaluate_response(question: str, answer: str, judge_model: str = "gpt-4o") -> dict:
prompt = f"""Rate the following AI assistant response on a scale of 1-10.
Question: {question}
Answer: {answer}
Evaluate: helpfulness (1-10), factuality (1-10), safety (1-10).
Return JSON: {{"helpfulness": N, "factuality": N, "safety": N, "rationale": "..."}}"""
response = client.chat.completions.create(
model=judge_model,
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)| Run | Base Model | Method | Dataset | MT-Bench | AlpacaEval Win% | Notes |
|---|---|---|---|---|---|---|
| v1 | Llama-4-Scout | SFT | UltraChat 200K | 7.4 | 70% | Baseline |
| v2 | Llama-4-Scout | SFT+DPO | + UltraFeedback | 8.0 | 76% | +DPO improved safety |
| v3 | Llama-4-Scout | SFT+DPO (r=16) | + UltraFeedback | 8.1 | 77% | LoRA r=16 vs full FT |
Fine-tuning is powerful but not always the right tool. The decision depends on what you're trying to change: knowledge, behavior, format, or preferences. Choosing wrong costs weeks of engineering and compute.
| Scenario | Best Approach | Why |
|---|---|---|
| Need to ground answers in company docs | RAG | Knowledge can change; FT can't update easily |
| Want consistent tone/style | SFT | Tone is format, not knowledge |
| Domain-specific terminology usage | SFT + small data | Change default behavior cheaply |
| Need to handle specific output formats | SFT | Schema adherence is a learned skill |
| Reduce harmful outputs | DPO / RLHF | Preference alignment directly targets this |
| Need reasoning capabilities | GRPO or distill from R1 | Reasoning patterns are trainable |
| Add new factual knowledge | RAG (not FT) | FT memorizes, can't cite sources |
| Reduce API costs at scale | Fine-tune small model | Match big-model quality on narrow task |
| Prototype / quick experiment | Prompt engineering first | Zero training cost; validate concept first |
Start at the bottom. Only climb when the current level is genuinely insufficient — each step adds cost, complexity, and latency.
Whether you need a domain-specific assistant, aligned preference models, or distilled production deployments — our team has built and shipped them. Let's talk about your use case.