Adaptation method · train_type: dora
LoRA plus a magnitude–direction decomposition: the combined weight is split into a unit-norm direction and a learnable magnitude, tuned independently. Matches full fine-tuning more closely on instruction-following.
DoRA keeps the same low-rank LoRA update but factorises the combined weight into a magnitude vector and a direction matrix so the two can be tuned independently. It costs roughly the same memory and time as LoRA; the only downside is a slightly larger adapter file.
V and a learnable magnitude m, then writes W' = m · V / ‖V‖. The motivation is empirical: full fine-tuning tends to update direction and magnitude by very different amounts, and DoRA reproduces that behaviour while keeping the parameter count near LoRA’s.nn.value_and_grad(model, …) changes.Let W₀ ∈ R^{out×in} be the frozen base weight.
W' = W₀ + ( α / r ) · B · A
V = W' (frozen after each step)
m = ‖W₀‖_c (per-column magnitude, learnable)
W_dora = m · V / ‖V‖
y = W_dora · x
The unit-norm rescaling on V is what makes DoRA different from LoRA plus a magnitude multiplier.
DoRA shares the LoRA settings table (rank, scale, dropout, num_layers, fuse, resume_adapter_file); set train_type: dora to select the DoRA wrapper.
DoRA is worth trying when LoRA plateaus on a metric that tracks style or format compliance (DoRA is reported to match full fine-tuning more closely on instruction-following). It costs roughly the same memory and time as LoRA; the only downside is a slightly larger adapter file.
On the Train tab → Fine-tune section: pick DORA in the segmented Fine-tune picker (LORA / DORA / FULL) → train_type: dora. The same LoRA Settings controls apply to DoRA — Layers (num_layers), Rank, Scale, Dropout — alongside the Quantization picker. fuse is a toggle in Training Settings.
train_type between runs, delete the old adapters.safetensors first — the layer name conventions differ and a stale file will silently fail to load.mlx_lm_lora.utils.from_pretrained → DoRA wrapper.