MLX-LoRA-Studio

GRPO

Family: Reinforcement / online · Reference model: optional · Judge: reward functions · QAT: no

RL without a learned reward model: sample a group, score with functions you write, optimise a group-relative advantage.

Overview

GRPO (DeepSeekMath, 2024) is an RL loop that does not need a learned reward model. For every prompt it samples group_size completions from the current policy, scores each completion with one or more user-supplied reward functions (format checks, accuracy checks, int-format, etc.), and computes a group-relative advantage: A_i = (r_i − mean(r)) / (std(r) + ε). The PPO-style clipped objective is then applied at the token level.

The KL term against the reference model is optional but recommended when the policy starts to drift in a way the rewards do not penalise.

Intuition

Objective (math)

For a prompt with G sampled completions and reward functions {r_k} with weights {w_k}. ε_low and ε_high (epsilon and epsilon_high) are the asymmetric clip bounds from DAPO. importance_sampling_level decides whether ratio is per-token or averaged across the sequence.

R_i        =  ∑_k  w_k · r_k(prompt, y_i)                          (total reward)

A_i        =  ( R_i  −  mean_j R_j )  /  ( std_j R_j + 1e−4 )      (group-normalised advantage)

ratio_i,t  =  π_θ(y_i,t)  /  π_ref(y_i,t)                          (importance ratio)

ℒ_clip     =  − min( ratio · A,  clip( ratio, 1−ε_low, 1+ε_high ) · A )

ℒ_KL       =  β · ( ratio · (π_ref / π_θ)  −  log(π_ref / π_θ)  −  1 )    (unbiased KL)

ℒ_GRPO     =  ( ℒ_clip  +  ℒ_KL )   averaged over valid tokens

Dataset format

GRPO needs at minimum a prompt field and, for the default r1_* reward functions, an answer field. The 4-tuple the trainer produces per row is (prompt_tokens, answer_tokens, prompt_text, answer_text); an optional 5th element is the type used to switch reward functions per category.

The bundled default is mlx-community/Dolci-Think-RL-7B-2k, a reasoning dataset.

When to use it

When you have a verifiable reward (math correctness, code execution, format compliance) rather than a labelled preference dataset. GRPO is the workhorse behind recent reasoning models (DeepSeek-R1, Qwen3-Instruct reasoning mode). Expect completions to look very different from SFT outputs — that is the point.

GRPO-specific settings

In addition to the shared SFT substrate:

Setting Default What it actually changes
group_size 4 Completions sampled per prompt. Higher = lower-variance advantage, more compute per step.
beta 0.1 KL penalty coefficient against the reference. 0 disables KL.
epsilon / epsilon_high 1e-4 / — Asymmetric PPO clip (ε_low, ε_high). If epsilon_high empty, both bounds default to epsilon.
max_completion_length 512 Max tokens sampled per completion. Drives time-per-step.
temperature / top_p / top_k / min_p 0.8 / 0.95 / 20 / 0.0 Sampler settings for in-loop generation. temperature=0 is invalid.
reward_functions Comma-separated reward function names. Empty ⇒ default r1_* family (r1_accuracy, r1_int, r1_strict_format, r1_soft_format, r1_count_xml).
reward_functions_file Path to a Python file that registers functions with @register_reward_function(), loaded via load_reward_functions_from_file (see Custom reward functions).
reward_weights Comma-separated weights matching the reward function list. Empty = all 1.0.
importance_sampling_level token (default), sequence, or empty. sequence averages the log-ratio per sequence (GSPO).
grpo_loss_type grpo grpo (mean over all tokens), bnpo (normalised by actual token count), dr_grpo (divided by batch_size · max_tokens).
reference_model_path Reference model used for KL and (when importance_sampling_level != none) the importance ratio.

In the app

On the Train tab, GRPO exposes two algorithm-specific blocks on top of the shared form.

Preference And Judge (shared with DPO/CPO/ORPO):

GRPO Generation And Rewards:

Shared form (every algorithm): Model & Data (base model, dataset, LM Studio export name); Fine-tune (train_type LoRA/DoRA/Full, with LoRA Settings — Layers, Rank, Scale, Dropout — for LoRA/DoRA, and Quantization None/4/6/8/MXFP4); Training Settings (Iterations, Epochs, Batch, Max Seq, Seed, Learning Rate, Optimizer, LR Schedule, Report/Eval/Save, Val Batches, Gradient accumulation, Sequence step size, Test Batches; grad-checkpoint / mask-prompt / fuse toggles); Dataset Columns (Prompt, Answer, Type — the columns the reward functions read); Output (run folder name); QAT (not applicable to GRPO).

Custom reward functions

How it works

GRPO scores each sampled completion with one or more reward functions you supply. The backend keeps a global registry (REWARD_REGISTRY) in mlx_lm_lora/trainer/grpo_reward_functions.py. Functions register themselves with the @register_reward_function() decorator; the trainer resolves the names you list in reward_functions via get_reward_function(name) and sums their weighted outputs into the per-completion reward R_i.

To use your own:

  1. Write a .py file that imports register_reward_function and decorates your functions.
  2. Point Reward functions Python file at it (or set reward_functions_file in YAML). The loader (load_reward_functions_from_file in train.py) execs the file via importlib, so the decorators run and populate the registry at startup.
  3. List the registered names in Custom reward function names (comma-separated) → reward_functions. Leave it empty to use the five built-in r1_* defaults.
  4. Optionally set Reward weights (same length as the function list) → reward_weights. Empty ⇒ all weights 1.0.

Function signature

Every reward function has the same signature:

RewardFunctions = Callable[[List[str], List[str], List[str], Optional[List[str]]], List[float]]

def my_reward(prompts, completions, answer, types=None) -> list[float]:
    ...

The absolute scale does not matter — GRPO normalises rewards within each prompt group into advantages (r − mean) / (std + ε). Only the relative ordering within a group matters, which is what makes GRPO robust to reward-function magnitude.

Built-in defaults

Registered in grpo_reward_functions.py, shown in the app’s Default Reward Functions list:

App label Function Reward
Accuracy r1_accuracy_reward_func 2.0 when the extracted <answer> exactly matches the dataset answer
Integer Answer r1_int_reward_func 0.5 when the extracted <answer> is digit-only
Strict Format r1_strict_format_reward_func 0.5 for strict <think>…</think><answer>…</answer> output
Soft Format r1_soft_format_reward_func 0.5 when think/answer tags appear in the right order with content
XML Count r1_count_xml small score for exactly one set of tags, with a trailing-text penalty

These expect <answer>…</answer> (and reasoning/think tags) in the completion — the DeepSeek-R1 style. If your data does not use that structure, write a custom function.

Example custom reward file

Save as e.g. my_rewards.py and point Reward functions Python file at it:

import json
from mlx_lm_lora.trainer.grpo_reward_functions import register_reward_function


@register_reward_function("exact_match")
def exact_match(prompts, completions, answer, types=None):
    """2.0 when the completion text exactly matches the reference answer."""
    return [2.0 if c.strip() == a.strip() else 0.0
            for c, a in zip(completions, answer)]


@register_reward_function("valid_json")
def valid_json(prompts, completions, answer, types=None):
    """1.0 for a JSON object, 0.5 for any valid JSON, 0.0 otherwise."""
    scores = []
    for c in completions:
        try:
            obj = json.loads(c.strip())
            scores.append(1.0 if isinstance(obj, dict) else 0.5)
        except Exception:
            scores.append(0.0)
    return scores


@register_reward_function("concise")
def concise(prompts, completions, answer, types=None):
    """1.0 for short answers, decaying to 0 as word count passes ~200."""
    return [max(0.0, 1.0 - len(c.split()) / 200.0) for c in completions]


@register_reward_function("keyword_bonus")
def keyword_bonus(prompts, completions, answer, types=None):
    """1.0 when the completion contains 'therefore', else 0.0."""
    return [1.0 if "therefore" in c.lower() else 0.0 for c in completions]

Then in the app set Custom reward function names to e.g. exact_match,valid_json,concise,keyword_bonus and Reward weights to e.g. [2.0, 1.0, 0.5, 0.5].

A category-aware example

Use the types argument to switch reward logic per row (the dataset type column maps to Type in Dataset Columns):

import json
from mlx_lm_lora.trainer.grpo_reward_functions import register_reward_function


@register_reward_function("by_type")
def by_type(prompts, completions, answer, types=None):
    scores = []
    types = types or [None] * len(completions)
    for c, a, t in zip(completions, answer, types):
        if t == "math":
            scores.append(2.0 if c.strip() == a.strip() else 0.0)
        elif t == "json":
            try:
                json.loads(c.strip())
                scores.append(1.0)
            except Exception:
                scores.append(0.0)
        else:
            scores.append(0.5 if c.strip() else 0.0)
    return scores

Example YAML

train_mode: grpo
model: mlx-community/Qwen3-0.6B-4bit
dataset:
  - mlx-community/Dolci-Think-RL-7B-2k
group_size: 4
max_completion_length: 512
temperature: 0.8
beta: 0.1                  # KL weight (0 disables KL)
epsilon: 1.0e-4
grpo_loss_type: grpo
importance_sampling_level: sequence
reward_functions: exact_match,by_type,concise
reward_weights: [2.0, 1.5, 0.5]
reward_functions_file: ~/my_rewards.py

Tips & gotchas

References

See also