How-to Guides¶
End-to-end researcher workflows. Each guide is a single coherent narrative with runnable code (or a link to a notebook, config, or script that runs it) — the reason most researchers come to the docs in the first place.
Core workflow¶
Build a model — compose a
Transformerfrom components; swap MLP for MoE; toggle QK-norm; register a new block variant.Prepare tokenized data — two supported paths: pre-tokenize offline (e.g. with
tatm) and validate withscripts/prepare_data.py, or stream from HuggingFace.End-to-end training run — the flagship walkthrough: tokenize → write a config → launch 1 GPU → launch 4 GPUs → resume → generate.
Scale up¶
Scaling guide — 1 → 32 GPU journey, when to add TP / FSDP / EP / PP, batch-size scaling, MFU goals, common pitfalls.
SLURM distributed setup — single-node → multi-node, InfiniBand, NCCL env, preemption, auto-resume.
Operations¶
Run evaluation —
run_eval()during training, standalonescripts/eval.py,scripts/eval_harness.pyfor lm-eval-harness.Generate from checkpoint — load a DCP checkpoint, call
generate()with temperature / top-k / top-p, interact withKVCache.Debug training regressions — NaN detector, profiler, memory monitor, health checks, five failure shapes and how to read them.
Research knobs¶
Compare optimizers — AdamW vs Muon vs Lion vs Schedule-Free AdamW, LR conventions, fair-comparison protocol.
Mix datasets and anneal data weights — weighted mixtures, temperature, phase transitions, LR scale on phase boundaries.
Turn on FP8 training — E4M3 / E5M2, bf16 master weights, FSDP2 float8 all-gather, exclusion rules, when FP8 doesn’t help.
MoE experiments — router choice, aux loss tuning, hot/cold expert diagnosis, when to turn on EP, shared experts.
Extract activations for interpretability —
ActivationStore,extract_representations(), save to.npz, feed to probing / CKA / SVCCA.