Training¶

The training loop itself and the knobs around it: optimizers, LR schedulers, loss functions, gradient utilities, in-loop evaluation, sampling, and the TrainingHook extension point.

The Data flow page is the one-slide overview of the step. This section zooms into each collaborator.

Training loop — a reader’s walkthrough of scripts/train.py: setup, the two step bodies (PP vs non-PP), phase transitions, periodic work.
Optimizers — adamw, lion, muon, schedule_free_adamw: the four registered optimizers, when to pick each, decay grouping, DTensor / FSDP2 notes.
Schedulers — cosine, linear, wsd, constant, rex, none: warmup and decay math, required fields.
Losses — cross_entropy, chunked_cross_entropy, and z_loss as a train-config regularizer (train.z_loss_weight).
Gradient utilities — maybe_no_sync for accumulation, clip_grad_norm_ for DTensor-aware clipping.
Evaluation — run_eval(), EvalConfig, the PP eval path, standalone scripts/eval.py.
Generation — generate() from kempnerforge/model/generate.py, top-k / top-p / temperature, KV cache, standalone scripts/generate.py.
Hooks — TrainingHook, HookRunner, lifecycle events, when to fork train.py vs write a hook.

Training¶

See also¶