Training¶
The training loop itself and the knobs around it: optimizers, LR
schedulers, loss functions, gradient utilities, in-loop evaluation,
sampling, and the TrainingHook extension point.
The Data flow page is the one-slide overview of the step. This section zooms into each collaborator.
Training loop — a reader’s walkthrough of
scripts/train.py: setup, the two step bodies (PP vs non-PP), phase transitions, periodic work.Optimizers —
adamw,lion,muon,schedule_free_adamw: the four registered optimizers, when to pick each, decay grouping, DTensor / FSDP2 notes.Schedulers —
cosine,linear,wsd,constant,rex,none: warmup and decay math, required fields.Losses —
cross_entropy,chunked_cross_entropy, andz_lossas a train-config regularizer (train.z_loss_weight).Gradient utilities —
maybe_no_syncfor accumulation,clip_grad_norm_for DTensor-aware clipping.Evaluation —
run_eval(),EvalConfig, the PP eval path, standalonescripts/eval.py.Generation —
generate()fromkempnerforge/model/generate.py, top-k / top-p / temperature, KV cache, standalonescripts/generate.py.Hooks —
TrainingHook,HookRunner, lifecycle events, when to forktrain.pyvs write a hook.
See also¶
Data flow — the one-page training loop overview.
Configuration § TrainConfig — every field this subsystem reads.