How-to Guides¶

End-to-end researcher workflows. Each guide is a single coherent narrative with runnable code (or a link to a notebook, config, or script that runs it) — the reason most researchers come to the docs in the first place.

Core workflow¶

Build a model — compose a Transformer from components; swap MLP for MoE; toggle QK-norm; register a new block variant.
Prepare tokenized data — two supported paths: pre-tokenize offline (e.g. with tatm) and validate with scripts/prepare_data.py, or stream from HuggingFace.
End-to-end training run — the flagship walkthrough: tokenize → write a config → launch 1 GPU → launch 4 GPUs → resume → generate.

Scale up¶

Scaling guide — 1 → 32 GPU journey, when to add TP / FSDP / EP / PP, batch-size scaling, MFU goals, common pitfalls.
SLURM distributed setup — single-node → multi-node, InfiniBand, NCCL env, preemption, auto-resume.

Operations¶

Run evaluation — run_eval() during training, standalone scripts/eval.py, scripts/eval_harness.py for lm-eval-harness.
Generate from checkpoint — load a DCP checkpoint, call generate() with temperature / top-k / top-p, interact with KVCache.
Debug training regressions — NaN detector, profiler, memory monitor, health checks, five failure shapes and how to read them.

Research knobs¶

Compare optimizers — AdamW vs Muon vs Lion vs Schedule-Free AdamW, LR conventions, fair-comparison protocol.
Mix datasets and anneal data weights — weighted mixtures, temperature, phase transitions, LR scale on phase boundaries.
Turn on FP8 training — E4M3 / E5M2, bf16 master weights, FSDP2 float8 all-gather, exclusion rules, when FP8 doesn’t help.
MoE experiments — router choice, aux loss tuning, hot/cold expert diagnosis, when to turn on EP, shared experts.
Extract activations for interpretability — ActivationStore, extract_representations(), save to .npz, feed to probing / CKA / SVCCA.