Family: Supervised · Reference model: no · Judge: no · QAT: yes
The canonical “teach the model to imitate a dataset” loop, and the substrate every other algorithm in MLX LoRA Studio is built on.
Supervised fine-tuning (SFT) trains the model to maximise the log-probability it assigns to a completion, given a prompt. Every example is a (prompt, completion) pair (or a chat-style messages list, or a raw text blob). Prompt tokens are masked out so the loss only counts the answer.
In mlx-lm-lora the SFT trainer is the substrate that every other algorithm reuses. The cross-entropy / NLL loss, gradient checkpointing, KV-cache handling, QAT hooks, and the sequence-chunked efficient-long-context forward pass all live in sft_trainer.py. Every preference and RL algorithm inherits those mechanics and only swaps the loss function and (sometimes) the dataset format.
grad_checkpoint, efficient_long_context, seq_step_size, and qat_* live.Per-example next-token cross-entropy, masked so only the completion contributes. Implemented in sft_trainer.default_loss: it shifts the input by one to form (inputs, targets), optionally restricts the loss to a [length_start, length_end] range via the cache offset (used for efficient_long_context), and averages over the masked token count.
ℒ_sft(θ) = − ∑_t log p_θ(y_t | prompt, y_<t) for t ∈ completion tokens
SFT accepts three shapes, in priority order:
messages list of {"role", "content"} dicts (with an optional system field). The trainer applies the tokenizer’s chat template.(prompt, completion) strings; the prompt is masked out of the loss when mask_prompt = true.text string for raw next-token pretraining-style fine-tuning.The bundled default for SFT is mlx-community/JOSIE-v2-Instruct-5K, a 5K-row instruction-tuning set in the messages format.
Minimal SFT YAML (the same file the GUI writes, runnable on the CLI with python -m mlx_lm_lora -c run.yaml):
train_mode: sft
model: mlx-community/Meta-Llama-3-8B-Instruct-4bit
dataset:
- mlx-community/JOSIE-v2-Instruct-5K
adapter:
type: lora
rank: 16
alpha: 32
dropout: 0.05
target_modules: [q_proj, k_proj, v_proj, o_proj]
optim:
learning_rate: 2.0e-4
schedule: cosine
warmup_steps: 20
weight_decay: 0.0
grad_accumulation_steps: 8
train:
max_steps: 1000
batch_size: 2
max_seq_len: 2048
save_every: 200
output:
adapter_path: ~/Library/Application Support/MLXLoRAStudio/runs/<id>/adapter
Almost always first. SFT establishes the model’s behaviour — tone, format, refusal style, domain knowledge. Every other algorithm assumes you have an SFT checkpoint that is already in the right ballpark for the task, and only fine-tunes alignment on top.
Reasonable defaults: lr=1e-5, batch_size=1, grad_accum=8, iters=1000, rank=8, scale=20.0.
These settings are the SFT substrate — every other algorithm inherits them. Each algorithm page lists only its additional settings on top of this table.
| Setting | Default | What it actually changes |
|---|---|---|
model |
Qwen/Qwen3-0.6B |
HF id or local path of the base model. Loaded with the chosen quantisation and wrapped with LoRA/DoRA/full adapters. |
data |
auto |
HF dataset id, local folder, or JSONL. The trainer’s create_dataset picks column names per train_mode. |
train_type |
lora |
lora, dora, or full. See LoRA, DoRA, Full fine-tuning. |
lora_parameters.rank |
8 |
Inner rank of the LoRA adapters. Higher rank = more capacity, more parameters. 4–16 typical on a laptop. |
lora_parameters.scale |
20.0 |
LoRA scaling factor (α). Effective update is (α/r) · B·A, so the ratio of scale to rank is what matters. |
lora_parameters.dropout |
0.0 |
Dropout inside the LoRA branch. Only > 0 for tiny datasets or visible overfitting. |
num_layers |
16 |
How many transformer layers (from the top) get LoRA adapters. |
batch_size |
1 |
Per-device minibatch. Real batch on disk is batch_size · gradient_accumulation_steps. |
gradient_accumulation_steps |
1 |
Micro-batches to accumulate before stepping the optimiser. |
iters / epochs |
iters=1000 |
Fixed iterations or full passes. epochs > 0 switches to epoch-counted training. |
learning_rate |
1e-5 |
Peak LR. 1e-5 to 5e-5 for LoRA; 1e-6 to 5e-6 for full fine-tuning. |
steps_per_report |
10 |
Print loss/reward every N optimiser steps. Drives the live-metrics tick. |
steps_per_eval |
200 |
Validation pass every N steps. -1 disables. |
val_batches |
25 |
Validation minibatches per eval. -1 uses the entire validation set. |
save_every |
100 |
Checkpoint adapter weights every N steps under <adapter_path>/<iter>_adapters.safetensors. |
max_seq_length |
2048 |
Hard cap on token length per example. Drives the KV-cache memory budget. |
grad_checkpoint |
true |
Rewrites every layer’s forward to use mx.checkpoint. ~halves activation memory for one extra recompute per layer. |
efficient_long_context |
false |
Splits long sequences into seq_step_size chunks and reuses the KV-cache. Enable when max_seq_length is large and you OOM. |
seq_step_size |
512 |
Chunk size for efficient_long_context. 256–1024; smaller = less memory, more recompute. |
mask_prompt |
false |
If true, loss is computed only over the completion. SFT and DPO enforce this internally; the flag is for custom datasets. |
fuse |
true |
Merge LoRA weights back into the base model after training and save a standalone checkpoint. |
lm_studio_name |
— |
If set, write the merged model into the LM Studio models directory under this name. |
resume_adapter_file |
— |
Path to a saved adapters.safetensors to warm-start from. Useful for SFT-then-DPO pipelines. |
seed |
0 |
Seed for numpy and mlx.core. MLX GPU kernels are not bit-exact across driver versions. |
load_in_4bits / 6 / 8 / mxfp4 |
4-bit |
Load-time quantisation. 4-bit/MXFP4 most aggressive; 8-bit safe default for preference training. See QLoRA. |
qat_enable |
false |
Install a symmetric fake-quantise hook on every nn.Linear after the first optimiser step. See QAT. |
qat_bits / group_size / start_step / interval |
8 / 64 / 1 / 1 |
Bit-width, group size (0 = per-tensor), first step to enable, re-project interval. |
test / test_batches |
false / 100 |
If test is true, evaluate a held-out test split after training and write the result into the run record. |
| Setting | Default | What it actually changes |
|---|---|---|
prompt_feature / completion_feature |
auto |
Column-name overrides for the (prompt, completion) schema. |
messages_feature / system_feature |
auto |
Column-name overrides for the chat-template messages schema (multi-turn). |
text_feature |
auto |
Column-name override for the raw-pretraining text schema. |
On the Train tab, SFT uses only the shared form — there is no algorithm-specific block:
train_type (LORA / DORA / FULL); when LoRA/DoRA, LoRA Settings (Layers → num_layers, Rank, Scale, Dropout); Quantization (None / 4-bit / 6-bit / 8-bit / MXFP4).efficient_long_context), Test Batches; toggles for grad checkpoint, mask prompt, fuse.loss drop within 100 steps. If it does not, the dataset probably has the wrong column names for SFT (the trainer raises Unsupported data format).grad_checkpoint first, then drop max_seq_length, then try efficient_long_context with a seq_step_size of 256.mask_prompt=true is the default behaviour for SFT in the trainer; the flag is exposed for custom datasets that mix masked and unmasked rows.vendor/mlx-lm-lora/mlx_lm_lora/trainer/sft_trainer.py (default_loss).