Family: Preference · Reference model: no · Judge: no · QAT: yes
DPO with the reference term dropped — faster, lighter, no second model in memory, at the cost of more sensitive
beta/deltatuning.
CPO (Contrastive Preference Optimisation) is DPO with the reference term dropped. The chosen-rejected log-prob gap is compared against an absolute target instead of a relative one — which means the policy can move further from the base model without a reference forward pass. It is faster to train and uses less memory, at the cost of being more sensitive to the beta/delta knobs.
loss_type = "dpop" adds a hinge penalty max(0, ref − π) (CPO substitutes the policy log-prob for the reference in the drift penalty) to keep that drift bounded by delta.beta and delta tuning.Same shape as DPO without the reference term in the main loss. The dpop CPO variant substitutes the policy log-prob for the reference in the drift penalty, because the reference is not in the forward pass.
logits = log π_θ(y_c|x) − log π_θ(y_r|x)
ℒ_CPO = − log σ( β · logits ) (sigmoid)
= max( 0, 1 − β · logits ) (hinge)
= ( logits − 1/(2β) )² (ipo)
= − log σ( β · logits )
+ δ · max( 0, log π_θ(y_r|x) − log π_θ(y_c|x) ) (dpop)
Same as DPO: one row per preference with chosen and rejected. CPO does not use the prompt column.
Same use case as DPO when you cannot afford a second reference model on the GPU, or when you want a more aggressive update. Pair with a slightly smaller beta than DPO would use.
In addition to the shared SFT substrate:
| Setting | Default | What it actually changes |
|---|---|---|
beta |
0.1 |
Temperature inside the CPO sigmoid. |
dpo_cpo_loss_type |
sigmoid |
Loss variant. CPO accepts the same four options as DPO; dpop here is the policy-side drift penalty. |
delta |
50.0 |
Coefficient for the CPO drift penalty. Only used with loss_type = dpop. |
chosen_feature / rejected_feature |
auto |
Column-name overrides. CPO does not need a separate prompt column. |
On the Train tab, CPO shows a Preference And Judge block on top of the shared form:
beta; Delta → delta (used only for dpop); Loss picker (Sigmoid / Hinge / IPO / DPOP) → dpo_cpo_loss_type.Shared form: Model & Data; Fine-tune (LoRA/DoRA/Full + Quantization); Training Settings; Output; QAT (CPO supports QAT).
beta than DPO would use (start at 0.05) — without the reference term the gradient is unanchored and large beta overshoots.dpop and add a delta to keep drift bounded.vendor/mlx-lm-lora/mlx_lm_lora/trainer/cpo_trainer.py.