MLX-LoRA-Studio

XPO (Exploratory Preference Optimisation)

Family: Reinforcement / online · Reference model: yes · Judge: LLM or human · QAT: no

Online DPO plus an explicit exploration bonus proportional to the KL from the reference — rewards trying completions that differ from the reference.

Overview

XPO (Exploratory Preference Optimisation) is the online-DPO family plus an explicit exploration bonus proportional to the KL divergence between the current policy and the reference. Setting alpha > 0 rewards the model for trying completions that are different from the reference, which mitigates the “stuck on the reference completion” pathology of plain online DPO. alpha can be a single float or a list (one per epoch) so the bonus can decay over training.

Intuition

Objective (math)

XPO loss is DPO with an additive exploration bonus proportional to the KL. A positive alpha means reward me for being different from the reference. When alpha is a list, get_current_alpha(step, total, schedule) returns the schedule element for the current step (one entry per epoch, with the last entry held to the end).

bonus    =  α · ( KL(π_θ ∥ π_ref) on chosen  +  KL(π_θ ∥ π_ref) on rejected )

ℒ_XPO    =  ℒ_DPO  −  bonus

Dataset format

Same as Online DPO: only a prompt field at training time; completions are sampled and judged on the fly.

When to use it

When online DPO collapses toward the reference completion and you want a controlled amount of exploration. Try alpha=1e-4 first, then either hold it constant or schedule it down across epochs.

XPO-specific settings

In addition to the shared SFT substrate:

Setting Default What it actually changes
beta 0.1 DPO temperature.
alpha 1e-5 Exploration-bonus coefficient. Single float = constant. A list (e.g. [1e-4, 1e-5, 1e-6]) gives a per-epoch schedule that decays.
dpo_cpo_loss_type sigmoid Base DPO loss variant.
delta 50.0 Drift-penalty coefficient for dpop.
judge Qwen/Qwen3-0.6B LLM or human for the pairwise judge.
judge_system Rubric system prompt for the judge.
max_completion_length 512 Maximum sampled completion length.
temperature 0.8 Sampling temperature for the policy.
reference_model_path Path/HF id of the frozen reference model.

In the app

On the Train tab, XPO shows an Online Preference block on top of the shared form:

Not exposed in the UI for online modes (use app defaults; edit via YAML): beta (0.1), dpo_cpo_loss_type (sigmoid), delta (50.0), reference_model_path (empty ⇒ second copy of the base model).

Shared form: Model & Data; Fine-tune (LoRA/DoRA/Full + Quantization); Training Settings; Output. (QAT is not applicable to online loops.)

Tips & gotchas

References

See also