Family: Preference · Reference model: no · Judge: no · QAT: yes
Fold SFT and preference tuning into a single loss — no reference model, no SFT warm-up stage.
ORPO (Hong et al., 2024) folds the SFT cross-entropy and the odds-ratio preference term into a single loss. There is no reference model and no SFT warm-up step — the chosen response is pushed up and the rejected one pulled down by the same gradient that improves the model’s next-token log-likelihood. In mlx_lm_lora it accepts an optional preference_score per example so heterogeneous preference strengths (e.g. UltraFeedback-style 0..10 scores) can be used as a soft target.
log σ( log π(chosen) − log π(rejected) ). The same gradient that lowers NLL on the chosen completion also increases that gap.preference_score (default 1.0) lets the trainer scale the chosen log-prob per example, so a soft preference of 0.3 still gets a smaller push than a hard preference of 1.0.reward_scaling is accepted but the upstream implementation does not actually use it as a separate multiplier — it is reserved for future variants.The same loss reduces NLL on the chosen completion because the chosen log-prob term appears in both halves of the gradient. The optional preference_score rescales the chosen log-prob before the subtraction, so a row with preference_score = 0.3 contributes roughly a third of the gradient of a row with score 1.0.
log_odds = log π_θ(y_c|x) − log π_θ(y_r|x) (mean over tokens)
ℒ_ORPO = − β · log σ( log_odds )
ORPO requires chosen and rejected and no prompt column (the chosen and rejected strings are used verbatim, and the model is expected to learn the prompt–completion split on its own). Optionally a preference_score (float) column scales the per-example gradient.
The bundled default is mlx-community/Josiefied-Qwen3-dpo-v1-flat, a flattened DPO dataset.
When you want a single-stage alternative to “SFT then DPO”. ORPO has been shown to work well on small chat models and is the natural choice for datasets that ship a per-example preference score (UltraFeedback-style score_chosen - score_rejected).
In addition to the shared SFT substrate:
| Setting | Default | What it actually changes |
|---|---|---|
beta |
0.1 |
Multiplier on the ORPO log-odds term. |
reward_scaling |
1.0 |
Reserved for future variants — the current ORPO loss does not use it as a separate multiplier. |
chosen_feature / rejected_feature / preference_score_feature |
auto |
Column-name overrides. The ORPO trainer expects chosen, rejected, and optionally preference_score. |
On the Train tab, ORPO shows a Preference And Judge block on top of the shared form:
beta; Delta → delta; Reward Scale → reward_scaling (reserved — the current ORPO loss does not use it as a separate multiplier); Loss picker (Sigmoid / Hinge / IPO / DPOP) → dpo_cpo_loss_type.preference_score column).Shared form: Model & Data; Fine-tune (LoRA/DoRA/Full + Quantization); Training Settings; Output; QAT (ORPO supports QAT).
prompt column — ORPO silently concatenates the prompt into the chosen/rejected if you let it, which is rarely what you want.preference_score should be normalised to roughly 0..1 before training; very large scores push the loss into saturated regions of log_sigmoid.vendor/mlx-lm-lora/mlx_lm_lora/trainer/orpo_trainer.py.