Compare optimizers¶
KempnerForge ships four optimizers. Each is registered under a name
that goes in optimizer.name, and each reads a different subset of
[optimizer] fields. This page explains what those subsets are, what
the LR conventions imply, and how to run a fair head-to-head.
|
Class |
Extra |
Typical LR |
|---|---|---|---|
|
|
— |
1e-4 – 3e-4 |
|
— (uses |
~3–10× smaller than AdamW |
|
|
|
~0.003 (see below) |
|
|
|
~0.025 (constant, see below) |
Each is a single-line swap:
[optimizer]
name = "muon" # or "adamw", "lion", "schedule_free_adamw"
lr = 0.003
AdamW¶
[optimizer]
name = "adamw"
lr = 3e-4
weight_decay = 0.1
betas = [0.9, 0.95]
eps = 1e-8
fused = true
Standard PyTorch fused AdamW. Two state tensors per parameter (first
and second moment). The default for KempnerForge —
configs/train/7b_16gpu_adamw.toml
is the reference 7B recipe.
Lion¶
[optimizer]
name = "lion"
lr = 3e-5 # ~10× smaller than the AdamW LR you'd use
weight_decay = 0.1
betas = [0.9, 0.99]
Sign-based momentum — the update direction is
sign(beta1 * m + (1 - beta1) * grad). Only one state tensor per
parameter (vs two for AdamW), so optimizer memory is roughly halved.
The class docstring recommends an LR 3–10× smaller than AdamW’s. If you
port a run from AdamW, scale lr down and leave weight_decay
alone — Lion’s decoupled weight-decay term keeps the same meaning.
eps is unused by Lion; it’s only in the shared config because other
optimizers read it.
Muon¶
[optimizer]
name = "muon"
lr = 0.003 # for 2D matrices (Newton-Schulz orthogonalized)
weight_decay = 0.1
muon_momentum = 0.95
muon_ns_steps = 5
muon_adam_lr = 3e-4 # Optional: LR for 1D params. Default = same as `lr`.
Muon orthogonalizes the 2D momentum update via a 5-step Newton-Schulz
iteration, then scales by sqrt(max(1, m/n)). This keeps the update
direction independent of parameter scale, so the effective LR is
much larger than an AdamW LR on the same model.
configs/train/7b_16gpu_muon.toml
uses lr = 0.003 as a working reference.
Split under the hood. build_optimizer routes each parameter
based on _is_muon_eligible:
Muon group: 2D+ parameters with a reasonable aspect ratio (
max(m, n) / min(m, n) <= 10). Transformer blocks’ weight matrices qualify.AdamW fallback group: 1D params (biases, norm scales) and highly-rectangular matrices (embeddings, output heads — where
X @ X^Twould bevocab_size × vocab_size).
The builder logs the two counts:
Muon: 6,390,525,952 params (NS-orthogonalized), 525,336,576 params (AdamW fallback)
LR coupling. Muon’s scheduler sees only the Muon param groups.
Before every Muon step(), the internal AdamW LR is rescaled by
muon_current_lr / muon_initial_lr so both groups follow the same
warmup/decay curve. If you set muon_adam_lr, it’s the initial LR
for the 1D group, then tracks the Muon schedule shape.
FSDP2 interaction. Newton-Schulz runs on each rank’s local shard
of the weight matrix rather than the fully gathered matrix — an
approximation documented in the
Muon docstring.
Works in practice at 16+ GPU scale; the reference Muon config has been
validated there.
Schedule-Free AdamW¶
[optimizer]
name = "schedule_free_adamw"
lr = 0.025 # constant — no scheduler decay
weight_decay = 0.1
betas = [0.9, 0.999]
schedule_free_warmup_steps = 200
[scheduler]
name = "none" # required — see below
Schedule-Free AdamW (Defazio & Mishchenko, 2024) maintains two iterates
per parameter — z (descent point) and x (Polyak average) — and
evaluates gradients at an interpolation y. The internal averaging
replaces an LR schedule, so the training-loop scheduler must be
none (the corresponding registered scheduler that leaves LR
constant):
# kempnerforge/config/scheduler.py
none = "none" # constant LR (for schedule-free optimizers)
Internal LR warmup is controlled by schedule_free_warmup_steps
(default 0 = no warmup). Use a few hundred to stabilize early training.
Memory. Three state tensors per parameter (z, v = second
moment, x = Polyak average) + the weight_sum scalar — larger
footprint than AdamW’s two.
Eval quirk. Best downstream numbers come from evaluating the
Polyak-averaged x rather than the training point y. The class
exposes eval_params() and train_params() to swap the parameters
in place, but the training loop does not call them automatically.
If you run scripts/eval.py on a schedule-free checkpoint, you’re
evaluating y unless you patch in those calls — usually close, but
not optimal.
Running a fair comparison¶
No benchmark utility ships — you run the comparisons yourself. The minimum protocol:
Fix everything except the optimizer. Copy
configs/train/7b_16gpu_adamw.tomltwice and only change[optimizer](and[scheduler]for schedule-free). Same seed, same data, samemax_steps, samebatch_size × grad_accum_steps × world_size.Use one LR per optimizer, not the same LR across optimizers. The conventions are different: a 3e-4 LR for AdamW is roughly 3e-5 for Lion and 3e-3 for Muon. Using AdamW’s LR with Muon makes Muon look catastrophic; using Muon’s LR with AdamW NaNs out.
Turn on per-optimizer warmup. AdamW / Lion / Muon use the training-loop scheduler’s warmup; Schedule-Free has
schedule_free_warmup_stepsinstead.Measure steady-state loss after warmup. Plot loss vs steps from at least
max(warmup_steps × 3, 200). The first few hundred steps are dominated by warmup transients and aren’t diagnostic.Log enough to diagnose.
train/grad_normfor instability,gpu/peak_gbfor memory footprint (especially for Schedule-Free vs Lion),tok/sfor throughput. See Metrics § Metrics tracker for the fields.
A realistic short-horizon pilot is ~500–1,000 steps on the
configs/train/hf_wikitext.toml
recipe — small model, no dataset setup. Rankings there don’t
necessarily transfer to a 7B run, but large discrepancies in memory
or divergence behavior will.
See also¶
Training § Optimizers — class-level reference for each optimizer’s update rule.
Configuration §
[optimizer]— everyOptimizerConfigfield and its validation rule.Training § Schedulers — the
"none"scheduler and why Schedule-Free needs it.Scaling guide — context for the reference configs this page points at.
kempnerforge/training/optimizer.py— source for all four optimizers.