Training

The training loop itself and the knobs around it: optimizers, LR schedulers, loss functions, gradient utilities, in-loop evaluation, sampling, and the TrainingHook extension point.

The Data flow page is the one-slide overview of the step. This section zooms into each collaborator.

  • Training loop — a reader’s walkthrough of scripts/train.py: setup, the two step bodies (PP vs non-PP), phase transitions, periodic work.

  • Optimizersadamw, lion, muon, schedule_free_adamw: the four registered optimizers, when to pick each, decay grouping, DTensor / FSDP2 notes.

  • Schedulerscosine, linear, wsd, constant, rex, none: warmup and decay math, required fields.

  • Lossescross_entropy, chunked_cross_entropy, and z_loss as a train-config regularizer (train.z_loss_weight).

  • Gradient utilitiesmaybe_no_sync for accumulation, clip_grad_norm_ for DTensor-aware clipping.

  • Evaluationrun_eval(), EvalConfig, the PP eval path, standalone scripts/eval.py.

  • Generationgenerate() from kempnerforge/model/generate.py, top-k / top-p / temperature, KV cache, standalone scripts/generate.py.

  • HooksTrainingHook, HookRunner, lifecycle events, when to fork train.py vs write a hook.

See also