Family: Preference · Reference model: yes · Judge: no · QAT: yes
Train directly on human preference pairs with a closed-form loss — no reward model, no sampling loop.
Direct Preference Optimisation (Rafailov et al., 2023) trains the model directly on human preference pairs (chosen, rejected) for the same prompt. There is no reward model and no sampling loop — just a closed-form loss that pulls the policy’s log-probability of the chosen completion up and the rejected completion down, regularised by a frozen reference model so the policy does not drift.
The reference is normally the base model you started from. Loading a different one (e.g. an instruction-tuned SFT checkpoint) shifts the implicit reward baseline and is a common knob for steering the resulting behaviour.
β · log(π_θ(chosen) / π_ref(chosen)) − β · log(π_θ(rejected) / π_ref(rejected)). A larger β makes the loss more aggressive, a smaller one softer.loss_type = "sigmoid" is the original DPO; hinge is a margin-style loss; ipo regularises toward a constant target (more robust to noise); dpop adds an explicit reference-drift penalty scaled by delta.For a single preference pair (y_c, y_r) and prompt x. β is the temperature: higher β pushes the policy harder toward the implicit reward. delta is only used in dpop and controls how much drift from the reference is penalised.
logits = ( log π_θ(y_c|x) − log π_θ(y_r|x) )
− ( log π_ref(y_c|x) − log π_ref(y_r|x) )
ℒ_DPO = − log σ( β · logits ) (sigmoid)
= max( 0, 1 − β · logits ) (hinge)
= ( logits − 1/(2β) )² (ipo)
= − log σ( β · logits )
+ δ · max( 0, log π_ref(y_c|x) − log π_θ(y_c|x) ) (dpop)
DPO expects a preference dataset with one row per preference, containing at minimum chosen and rejected. The prompt column is optional but recommended; if present it is prepended to both completions before tokenisation.
The bundled default is mlx-community/Human-Like-DPO, which has the prompt/chosen/rejected shape.
After SFT, when you have a static preference dataset (UltraFeedback, HelpSteer, Anthropic HH). DPO is the cheapest preference algorithm — one forward pass per completion on the policy, one forward pass on the reference.
Use loss_type=ipo if you have noisy or contradictory labels, loss_type=dpop if the policy starts drifting from the reference in ways the SFT loss did not catch.
In addition to the shared SFT substrate:
| Setting | Default | What it actually changes |
|---|---|---|
beta |
0.1 |
Temperature inside the DPO sigmoid. Lower = softer updates, higher = more aggressive divergence from the reference. |
dpo_cpo_loss_type |
sigmoid |
sigmoid (vanilla DPO), hinge (margin), ipo (squared error around 1/(2β)), dpop (reference-drift penalty). |
delta |
50.0 |
Coefficient on the dpop reference-drift penalty. Ignored unless dpo_cpo_loss_type = dpop. |
reference_model_path |
— |
Path/HF id of the frozen reference. If empty, the trainer instantiates a second copy of the base model (doubles GPU memory). |
prompt_feature / chosen_feature / rejected_feature |
auto |
Column-name overrides for the preference schema. |
On the Train tab, DPO shows a Preference And Judge block on top of the shared form:
beta; Delta → delta (used only for dpop); Loss picker (Sigmoid / Hinge / IPO / DPOP) → dpo_cpo_loss_type; Reference model path → reference_model_path (empty ⇒ second copy of the base model in memory).Shared form: Model & Data; Fine-tune (LoRA/DoRA/Full + Quantization); Training Settings (LR, Optimizer, schedule, batch, iters, grad accumulation, etc.); Output; QAT (DPO supports QAT).
reference_model_path to the SFT checkpoint, not the original base model — the implicit reward is then Δ log-prob vs the SFT model, which is what your labels reflect.accuracies and margins in the live metrics. accuracies > 0.7 with margins > 0 means the loss is doing real work; if margins plateaus, raise beta slightly.loss_type=dpop with delta too large for the current lr. Drop delta to 10 first.efficient_long_context applies here too — preference datasets with long answers benefit from chunked forward passes.vendor/mlx-lm-lora/mlx_lm_lora/trainer/dpo_trainer.py.