Family: Reinforcement / online · Reference model: yes · Judge: LLM or human · QAT: no
The textbook clipped policy-gradient objective applied to LMs — the most expressive and the most finicky loop.
PPO (Schulman et al., 2017) as applied to LMs is the textbook clipped policy optimisation. For each prompt the trainer samples two completions, asks the judge which is better, and treats them as (chosen, rejected). It then computes log-ratios against the reference model and minimises the clipped surrogate objective on both sequences, plus a KL penalty.
It is the most powerful and the most finicky of the loops: the epsilon clip range, the beta KL weight, and the judge quality all matter a lot.
−min(ρ · A, clip(ρ, 1−ε, 1+ε) · A), with ρ = π / π_ref and A derived from the chosen-rejected reward gap.epsilon = 0.2 is the classic Schulman default; tightening it (e.g. 0.1) makes updates more conservative, loosening it (e.g. 0.3) lets the policy move further per step.For a prompt, two completions, and a winner w. The advantage is not a reward-model output here — it is derived from the log-prob gap between chosen and rejected, after the judge has decided which is which.
A = log π_θ(y_c) − log π_θ(y_r) (per-sequence advantage)
A_norm = ( A − mean ) / ( std + 1e−8 )
ρ_c = exp( log π_θ(y_c) − log π_ref(y_c) )
ρ_r = exp( log π_θ(y_r) − log π_ref(y_r) )
ℒ_surr = − min( ρ_c · A_norm, clip( ρ_c, 1−ε, 1+ε ) · A_norm )
= − min( ρ_r · (−A_norm), clip( ρ_r, 1−ε, 1+ε ) · (−A_norm) )
ℒ_PPO = ℒ_surr + β · ( mean( log π_θ − log π_ref ) )
Same as the other online loops: only a prompt field at training time. Completions are sampled from the policy and the chosen/rejected split comes from the judge.
The bundled default is mlx-community/Human-Like-DPO.
The classic, the most expressive, and the most finicky. Reach for PPO when the other loops are under-performing on a metric the judge captures well, and you have time to tune epsilon and beta together. Always log clip_fraction — if it sits at 0% the policy is not moving, if it sits at >30% the policy is moving too far per step.
In addition to the shared SFT substrate:
| Setting | Default | What it actually changes |
|---|---|---|
beta |
0.1 |
KL regulariser weight, added to the clipped surrogate. |
epsilon |
0.2 |
PPO clip range for the importance ratio. The classic value; lower it for more conservative updates. |
dpo_cpo_loss_type |
sigmoid |
Loss variant (the chosen/rejected split is the same; only the inner objective changes — the runner passes it through). |
delta |
50.0 |
Drift-penalty coefficient for dpop. |
judge |
Qwen/Qwen3-0.6B |
Pairwise judge (LLM or human). |
judge_system |
— |
Rubric system prompt for the judge. |
max_completion_length |
512 |
Maximum sampled completion length. |
temperature |
0.8 |
Sampling temperature for the policy completions. |
reference_model_path |
— |
Frozen reference used in the importance ratio. |
On the Train tab, PPO shows an Online Preference block on top of the shared form:
max_completion_length; Temp → temperature; Epsilon → epsilon (the PPO clip range; classic default 0.2).judge (model id / local path); User → judge = "human".judge_system (shown for the LLM judge — the rubric PPO will amplify, so make it precise).Not exposed in the UI for online modes (use app defaults; edit via YAML): beta (0.1, KL regulariser weight), dpo_cpo_loss_type (sigmoid), delta (50.0), reference_model_path (empty ⇒ second copy of the base model, used in the importance ratio).
Shared form: Model & Data; Fine-tune (LoRA/DoRA/Full + Quantization); Training Settings; Output. (QAT is not applicable to online loops.)
clip_fraction. If it is consistently > 0.3, the policy is moving too far per step — either lower lr or tighten epsilon to 0.1.epsilon = 0.2 is the original PPO default and is a fine starting point; do not lower it until you have seen the policy train for at least one full pass.beta) that is too small lets the policy drift far from the reference; a beta that is too large suppresses the policy before it learns anything. Start at 0.1 and adjust based on the kl_penalty log.vendor/mlx-lm-lora/mlx_lm_lora/trainer/ppo_trainer.py.