MLX-LoRA-Studio

Online DPO

Family: Reinforcement / online · Reference model: yes · Judge: LLM or human · QAT: no

DPO where the chosen and rejected completions are sampled at training time and labelled by a judge — no stale off-policy dataset.

Overview

Online DPO is DPO where the chosen and rejected completions are sampled at training time instead of read from a static dataset. For every prompt the trainer draws two completions, asks a judge (an LLM or a human) which one is better, treats that as the preference pair, and runs the DPO loss on the fly.

The judge can be human (you label pairs interactively), or a Hugging Face model identifier / local path that the runner loads and prompts with a pairwise system prompt.

Intuition

Objective (math)

For each prompt x, sample two completions (y_1, y_2), judge picks the winner w ∈ {0, 1}. The temperature and judge system prompt together control how informative the labels are; beta and loss_type are identical to DPO.

y_chosen   =  y_w
y_rejected =  y_{1−w}

ℒ  =  DPO loss on ( y_chosen, y_rejected )    — see DPO math

Dataset format

Online loops only need a prompt field at training time. The completions are sampled from the policy itself, and the (chosen, rejected) pair comes from the judge, not from the dataset.

The bundled default is mlx-community/Human-Like-DPO for its small size and standard prompt column.

When to use it

When you have a strong LLM judge (or a human in the loop) and a base prompt distribution you can keep sampling from. Online DPO avoids the off-policy gap of static DPO and works well with iterative refinement of the same model.

Online-DPO-specific settings

In addition to the shared SFT substrate:

Setting Default What it actually changes
beta 0.1 DPO temperature.
dpo_cpo_loss_type sigmoid Loss variant. Same four options as DPO.
delta 50.0 Drift-penalty coefficient for dpop loss.
judge Qwen/Qwen3-0.6B HF id / local path of the judge LLM, or the literal string human. With human the runner pauses and asks you to label each pair.
judge_system System prompt sent to the LLM judge. Treat this as the rubric — short, specific, concrete.
max_completion_length 512 Max tokens sampled per completion in the in-loop generation.
temperature 0.8 Sampling temperature for the policy. Lower ⇒ both completions look more similar ⇒ harder comparisons for the judge.
reference_model_path Path/HF id of the frozen reference. Empty ⇒ second copy of the base model.

In the app

On the Train tab, Online DPO shows an Online Preference block on top of the shared form:

Not exposed in the UI for online modes (they use app defaults; edit via YAML if needed): beta (default 0.1), dpo_cpo_loss_type (default sigmoid), delta (default 50.0), reference_model_path (default empty ⇒ second copy of the base model).

Shared form: Model & Data; Fine-tune (LoRA/DoRA/Full + Quantization); Training Settings; Output. (QAT is not applicable to online loops.)

Tips & gotchas

References

See also