MLX-LoRA-Studio

ORPO

Family: Preference · Reference model: no · Judge: no · QAT: yes

Fold SFT and preference tuning into a single loss — no reference model, no SFT warm-up stage.

Overview

ORPO (Hong et al., 2024) folds the SFT cross-entropy and the odds-ratio preference term into a single loss. There is no reference model and no SFT warm-up step — the chosen response is pushed up and the rejected one pulled down by the same gradient that improves the model’s next-token log-likelihood. In mlx_lm_lora it accepts an optional preference_score per example so heterogeneous preference strengths (e.g. UltraFeedback-style 0..10 scores) can be used as a soft target.

Intuition

Objective (math)

The same loss reduces NLL on the chosen completion because the chosen log-prob term appears in both halves of the gradient. The optional preference_score rescales the chosen log-prob before the subtraction, so a row with preference_score = 0.3 contributes roughly a third of the gradient of a row with score 1.0.

log_odds  =  log π_θ(y_c|x)  −  log π_θ(y_r|x)        (mean over tokens)

ℒ_ORPO    =  − β · log σ( log_odds )

Dataset format

ORPO requires chosen and rejected and no prompt column (the chosen and rejected strings are used verbatim, and the model is expected to learn the prompt–completion split on its own). Optionally a preference_score (float) column scales the per-example gradient.

The bundled default is mlx-community/Josiefied-Qwen3-dpo-v1-flat, a flattened DPO dataset.

When to use it

When you want a single-stage alternative to “SFT then DPO”. ORPO has been shown to work well on small chat models and is the natural choice for datasets that ship a per-example preference score (UltraFeedback-style score_chosen - score_rejected).

ORPO-specific settings

In addition to the shared SFT substrate:

Setting Default What it actually changes
beta 0.1 Multiplier on the ORPO log-odds term.
reward_scaling 1.0 Reserved for future variants — the current ORPO loss does not use it as a separate multiplier.
chosen_feature / rejected_feature / preference_score_feature auto Column-name overrides. The ORPO trainer expects chosen, rejected, and optionally preference_score.

In the app

On the Train tab, ORPO shows a Preference And Judge block on top of the shared form:

Shared form: Model & Data; Fine-tune (LoRA/DoRA/Full + Quantization); Training Settings; Output; QAT (ORPO supports QAT).

Tips & gotchas

References

See also