Foundation · optimizer: adam / adamw / muon
How gradients become weight updates. AdamW is the safe default; Muon is a free-lunch speedup on LoRA/DoRA hidden weights of 1B+ models.
The built-in guide documents three optimizers — Adam, AdamW, and Muon — but the app’s Optimizer picker exposes the full mlx.optimizers set (10 in total: SGD, RMSprop, Adagrad, AdaDelta, Adam, AdamW, Adamax, Lion, Adafactor, Muon). The first two documented ones are the workhorses; Muon is a relatively recent addition (Jordan et al., 2024) that has been shown to converge faster on transformer hidden weights. The selection is exposed through config.optimizer and the trainer picks the matching class in mlx.optimizers at construction time.
m and the second moment v of the gradient, and applies a bias-corrected update. The default betas = (0.9, 0.999) work well across a wide range of LRs.λ · W added directly to the update rather than appearing inside the gradient. This is the correct way to do L2 regularisation on adaptive optimizers and is the standard choice for transformer fine-tuning.opt = OptClass(learning_rate=lr, **optimizer_config[opt_name]), so per-optimizer hyperparameters (betas, eps, weight_decay, momentum) are forwarded verbatim from the optimizer_config dict.Let g_t = ∇ℒ(θ_{t−1}) be the gradient at step t, and lr the learning rate. All three store state in optimizer.state; the trainer seeds it from mx.random.state for determinism.
Adam (Kingma & Ba, 2014):
m_t = β₁ · m_{t−1} + ( 1 − β₁ ) · g_t (first moment)
v_t = β₂ · v_{t−1} + ( 1 − β₂ ) · g_t² (second moment)
m̂_t = m_t / ( 1 − β₁^t ) (bias correction)
v̂_t = v_t / ( 1 − β₂^t ) (bias correction)
θ_t = θ_{t−1} − lr · m̂_t / ( √v̂_t + ε )
Default in mlx.optimizers: β₁ = 0.9, β₂ = 0.999, ε = 1e−8. L2 regularisation is not applied — use AdamW if you want weight decay.
AdamW (Loshchilov & Hutter, 2019):
( m_t, v_t, m̂_t, v̂_t ) ← Adam update as above
θ_t = θ_{t−1} − lr · ( m̂_t / ( √v̂_t + ε ) + λ · θ_{t−1} )
The λ · θ term is the decoupled weight decay. Default weight_decay = 0.01; tune to control how aggressively the model is pulled toward zero (and how much the LoRA/DoRA adapters are encouraged to stay small).
Muon (Jordan et al., 2024):
m_t = μ · m_{t−1} + g_t (momentum buffer, μ ≈ 0.95)
O_t = NewtonSchulz5( m_t ) (≈ orthogonalise the momentum)
scale = √( out · in ) (spectral-norm-preserving scale)
θ_t = θ_{t−1} − lr · scale · O_t
The Newton–Schulz iteration is a small fixed-point loop (5 steps in the paper) that maps a matrix to its nearest semi-orthogonal one. The update is a single matrix multiply per parameter, so wall-clock cost is comparable to AdamW despite the extra iteration.
| Setting | Default | What it actually changes |
|---|---|---|
optimizer |
adamw |
Pick adam, adamw, or muon. Class loaded from mlx.optimizers, constructed with learning_rate=lr plus the matching optimizer_config dict. |
learning_rate |
1e-5 |
Peak LR. LoRA/DoRA: 1e-5 to 5e-5 (AdamW), 5e-4 to 5e-3 (Muon). Full fine-tuning: 1e-6 to 5e-6 (AdamW). |
lr_schedule |
— |
Optional schedule from mlx_lm.tuner.utils.build_schedule. If non-empty, wraps learning_rate; otherwise constant. |
optimizer_config.adam |
{} |
Extra kwargs for optim.Adam: betas, eps. |
optimizer_config.adamw |
{} |
Extra kwargs for optim.AdamW: betas, eps, weight_decay. Default weight_decay=0.01; raise to 0.05 if LoRA magnitudes drift up. |
optimizer_config.muon |
{} |
Extra kwargs for optim.Muon: momentum, nesterov, weight_decay. Defaults momentum=0.95, nesterov=True. |
weight_decay=0.0 unless you specifically want the LoRA magnitudes regularised.On the Train tab → Training Settings:
mlx.optimizers classes: SGD, RMSprop, Adagrad, AdaDelta, Adam, AdamW, Adamax, Lion, Adafactor, Muon → optimizer.learning_rate.lr_schedule, with Warmup, Decay Fraction, and Final LR fields when a schedule is selected.Per-optimizer kwargs (betas, eps, weight_decay, momentum, nesterov) are forwarded from the optimizer_config dict — they are not exposed as dedicated fields in the UI. Edit them directly in the run’s YAML (optimizer_config.adamw: {weight_decay: 0.01, betas: [0.9, 0.999]}) if you need to tune them.
weight_decay for AdamW on a full fine-tuning run, expect a small loss bump in the first 100 steps as the regulariser pulls weights toward zero. Normal; usually recovers within an epoch.lr_schedule is a small DSL from upstream mlx-examples. Common choices: cosine:<iters>:<min_lr> for cosine decay, warmup_cosine:<warmup>:<iters>:<min_lr> for warm-up followed by decay.