MLX-LoRA-Studio

Optimizers

Foundation · optimizer: adam / adamw / muon

How gradients become weight updates. AdamW is the safe default; Muon is a free-lunch speedup on LoRA/DoRA hidden weights of 1B+ models.

Overview

The built-in guide documents three optimizers — Adam, AdamW, and Muon — but the app’s Optimizer picker exposes the full mlx.optimizers set (10 in total: SGD, RMSprop, Adagrad, AdaDelta, Adam, AdamW, Adamax, Lion, Adafactor, Muon). The first two documented ones are the workhorses; Muon is a relatively recent addition (Jordan et al., 2024) that has been shown to converge faster on transformer hidden weights. The selection is exposed through config.optimizer and the trainer picks the matching class in mlx.optimizers at construction time.

Intuition

Objective (math)

Let g_t = ∇ℒ(θ_{t−1}) be the gradient at step t, and lr the learning rate. All three store state in optimizer.state; the trainer seeds it from mx.random.state for determinism.

Adam (Kingma & Ba, 2014):

m_t   =  β₁ · m_{t−1}  +  ( 1 − β₁ ) · g_t               (first moment)
v_t   =  β₂ · v_{t−1}  +  ( 1 − β₂ ) · g_t²              (second moment)
m̂_t  =  m_t / ( 1 − β₁^t )                              (bias correction)
v̂_t  =  v_t / ( 1 − β₂^t )                              (bias correction)
θ_t   =  θ_{t−1}  −  lr · m̂_t / ( √v̂_t + ε )

Default in mlx.optimizers: β₁ = 0.9, β₂ = 0.999, ε = 1e−8. L2 regularisation is not applied — use AdamW if you want weight decay.

AdamW (Loshchilov & Hutter, 2019):

( m_t, v_t, m̂_t, v̂_t )  ←  Adam update as above
θ_t   =  θ_{t−1}  −  lr · ( m̂_t / ( √v̂_t + ε )  +  λ · θ_{t−1} )

The λ · θ term is the decoupled weight decay. Default weight_decay = 0.01; tune to control how aggressively the model is pulled toward zero (and how much the LoRA/DoRA adapters are encouraged to stay small).

Muon (Jordan et al., 2024):

m_t   =  μ · m_{t−1}  +  g_t                       (momentum buffer, μ ≈ 0.95)
O_t   =  NewtonSchulz5( m_t )                      (≈ orthogonalise the momentum)
scale =  √( out · in )                             (spectral-norm-preserving scale)
θ_t   =  θ_{t−1}  −  lr · scale · O_t

The Newton–Schulz iteration is a small fixed-point loop (5 steps in the paper) that maps a matrix to its nearest semi-orthogonal one. The update is a single matrix multiply per parameter, so wall-clock cost is comparable to AdamW despite the extra iteration.

What the settings change

Setting Default What it actually changes
optimizer adamw Pick adam, adamw, or muon. Class loaded from mlx.optimizers, constructed with learning_rate=lr plus the matching optimizer_config dict.
learning_rate 1e-5 Peak LR. LoRA/DoRA: 1e-5 to 5e-5 (AdamW), 5e-4 to 5e-3 (Muon). Full fine-tuning: 1e-6 to 5e-6 (AdamW).
lr_schedule Optional schedule from mlx_lm.tuner.utils.build_schedule. If non-empty, wraps learning_rate; otherwise constant.
optimizer_config.adam {} Extra kwargs for optim.Adam: betas, eps.
optimizer_config.adamw {} Extra kwargs for optim.AdamW: betas, eps, weight_decay. Default weight_decay=0.01; raise to 0.05 if LoRA magnitudes drift up.
optimizer_config.muon {} Extra kwargs for optim.Muon: momentum, nesterov, weight_decay. Defaults momentum=0.95, nesterov=True.

When to use which

In the app

On the Train tab → Training Settings:

Per-optimizer kwargs (betas, eps, weight_decay, momentum, nesterov) are forwarded from the optimizer_config dict — they are not exposed as dedicated fields in the UI. Edit them directly in the run’s YAML (optimizer_config.adamw: {weight_decay: 0.01, betas: [0.9, 0.999]}) if you need to tune them.

Tips & gotchas

References

See also