MLX-LoRA-Studio

RLHF-REINFORCE

Family: Reinforcement / online · Reference model: optional · Judge: LLM · QAT: no

The classic policy-gradient RLHF loop — conceptually simplest, highest variance, no clipping, no value head.

Overview

RLHF-REINFORCE is the classic policy-gradient RLHF loop. The trainer samples completions from the policy, scores them with a scalar reward (an LLM judge, normally configured via the judge and judge_system settings), and applies a per-token REINFORCE objective regularised by an optional KL penalty against a reference model.

It is conceptually the simplest of the online algorithms (no clipping, no value head) but it has the highest variance, so it benefits from smaller learning rates and longer KL warm-up than DPO/PPO.

Intuition

Objective (math)

For a prompt x and sampled completion y with judge reward R. No clipping, no value head. β is the KL weight in the advantage (not in a separate regulariser).

A_t            =  R  −  β · KL_t                              (per-token advantage)

ℒ_REINFORCE    =  − ∑_t  A_t · log π_θ(y_t | x, y_<t)

Dataset format

Same as the other online loops: only a prompt field at training time. The completion is sampled from the policy and the scalar reward comes from the judge.

The bundled default is mlx-community/Human-Like-DPO.

When to use it

Educational / minimal RLHF. With a modern LLM judge, REINFORCE is competitive with PPO for short completions and is much simpler to debug. It is also the algorithm most sensitive to judge quality and learning rate — start with lr=5e-6 and beta=0.05.

RLHF-REINFORCE-specific settings

In addition to the shared SFT substrate:

Setting Default What it actually changes
beta 0.1 Coefficient on the KL term in the per-token advantage (A = R − β · KL). 0 disables KL regularisation.
judge Qwen/Qwen3-0.6B LLM judge that produces the scalar reward. Loaded once, called once per (prompt, completion).
max_completion_length 128 Max tokens sampled per completion. REINFORCE is variance-sensitive, so shorter completions usually help.
reference_model_path Frozen reference used to compute the per-token KL. Empty ⇒ second copy of the base model.

In the app

On the Train tab, RLHF-REINFORCE shows an Online Preference block on top of the shared form:

Not exposed in the UI for this mode (use app defaults; edit via YAML): beta (0.1, KL weight in the per-token advantage), judge_system (the rubric; not surfaced for REINFORCE in the UI — set it in YAML if you want a specific rubric), reference_model_path (empty ⇒ second copy of the base model). No temperature field — REINFORCE samples at the policy’s default temperature.

Shared form: Model & Data; Fine-tune (LoRA/DoRA/Full + Quantization); Training Settings; Output. (QAT is not applicable to online loops.)

Tips & gotchas

References

See also