MLX-LoRA-Studio

QLoRA (Load-Time Quantization)

Quantization · load_in_4bits / load_in_6bits / load_in_8bits / load_in_mxfp4

Load the base model in N-bit integers with a per-group scale, then train full-precision LoRA/DoRA adapters on top. The standard “QLoRA” recipe for fitting 7B–13B models on a 16 GB laptop.

Overview

mlx-lm-lora supports two distinct kinds of quantisation that are easy to confuse. This page covers load-time quantisation. (See QAT for the in-training variant.)

When the base model is read from disk, mlx.nn.quantize replaces each nn.Linear with a QuantizedLinear that stores the weights in N-bit signed integers and a per-group scale. The base model stays in memory in that form for the rest of the run, and the LoRA / DoRA adapters are kept in full precision on top of it. Quantisation happens at load time and is irreversible for the loaded model object — the only way to change the base precision is to re-load the model (restart the runner after changing the load_in_* flags).

Load-time quantisation composes with QAT: a typical “QLoRA + QAT” run loads the base model in 4-bit and then trains with the QAT hook enabled so the LoRA updates are robust to that 4-bit precision.

Intuition

The matrix W ∈ R^{out×in} is split along the last axis into groups of group_size consecutive columns, each group is divided by a scale s_g = max(|W_g|) / qmax, and the values are rounded to signed N-bit integers. The quantised weight is Q_g = round(W_g / s_g) and the dequantised weight is Q_g · s_g ≈ W_g.

Objective (math)

Affine quantisation (all four modes):

# Per group of group_size consecutive columns of W:
s_g   =  max(| W_g |) / q_max                 (positive scale)
Q_g   =  clip( round( W_g / s_g ),  q_min,  q_max )    (N-bit signed int)
Ŵ_g   =  Q_g  ·  s_g                           (dequantised)

with q_max = 2^(N−1) − 1 and q_min = −q_max − 1. For N = 4, q_max = 7 and q_min = −8. For MXFP4 the format is identical but the group size is fixed at 32 and the scale is stored as an 8-bit E8M0 value (a microscaling format).

What the settings change

Setting Default What it actually changes
load_in_4bits false Quantise the base model to 4-bit on load (group size 128). Most aggressive; pairs naturally with LoRA/DoRA.
load_in_6bits false 6-bit quantisation (group size 128). Rarely the right choice — usually 4-bit or 8-bit wins.
load_in_8bits false 8-bit quantisation (group size 128). Safe default for preference training and long contexts with precise numerics.
load_in_mxfp4 false MXFP4 (microscaling 4-bit), group size 32, 8-bit E8M0 scale. Functionally similar to 4-bit on Apple Silicon but with smaller groups.

Memory expectations

Approximate, single-GPU Apple Silicon, batch_size 2, 2048-token sequences (from the README):

Base model size QLoRA 4-bit
1B–3B 5 GB
7B–8B 7 GB
13B 10 GB
70B 38 GB

Which to pick

In the app

On the Train tab → Fine-tune section, the Quantization dropdown selects the load-time precision:

Pair it with LoRA or DoRA in the same Fine-tune section. Quantisation is applied at load time and is irreversible for the loaded model object — restart the run after changing it. For the “QLoRA + QAT” recipe, also enable the QAT section.

Tips & gotchas

References

See also