MLX-LoRA-Studio

Contrastive Preference Optimisation (CPO)

Family: Preference · Reference model: no · Judge: no · QAT: yes

DPO with the reference term dropped — faster, lighter, no second model in memory, at the cost of more sensitive beta/delta tuning.

Overview

CPO (Contrastive Preference Optimisation) is DPO with the reference term dropped. The chosen-rejected log-prob gap is compared against an absolute target instead of a relative one — which means the policy can move further from the base model without a reference forward pass. It is faster to train and uses less memory, at the cost of being more sensitive to the beta/delta knobs.

Intuition

Objective (math)

Same shape as DPO without the reference term in the main loss. The dpop CPO variant substitutes the policy log-prob for the reference in the drift penalty, because the reference is not in the forward pass.

logits  =  log π_θ(y_c|x)  −  log π_θ(y_r|x)

ℒ_CPO   =  − log σ( β · logits )                                  (sigmoid)
         =  max( 0,  1 − β · logits )                             (hinge)
         =  ( logits − 1/(2β) )²                                  (ipo)
         =  − log σ( β · logits )
            + δ · max( 0,  log π_θ(y_r|x) − log π_θ(y_c|x) )      (dpop)

Dataset format

Same as DPO: one row per preference with chosen and rejected. CPO does not use the prompt column.

When to use it

Same use case as DPO when you cannot afford a second reference model on the GPU, or when you want a more aggressive update. Pair with a slightly smaller beta than DPO would use.

CPO-specific settings

In addition to the shared SFT substrate:

Setting Default What it actually changes
beta 0.1 Temperature inside the CPO sigmoid.
dpo_cpo_loss_type sigmoid Loss variant. CPO accepts the same four options as DPO; dpop here is the policy-side drift penalty.
delta 50.0 Coefficient for the CPO drift penalty. Only used with loss_type = dpop.
chosen_feature / rejected_feature auto Column-name overrides. CPO does not need a separate prompt column.

In the app

On the Train tab, CPO shows a Preference And Judge block on top of the shared form:

Shared form: Model & Data; Fine-tune (LoRA/DoRA/Full + Quantization); Training Settings; Output; QAT (CPO supports QAT).

Tips & gotchas

References

See also