How-to Guides

End-to-end researcher workflows. Each guide is a single coherent narrative with runnable code (or a link to a notebook, config, or script that runs it) — the reason most researchers come to the docs in the first place.

Core workflow

  • Build a model — compose a Transformer from components; swap MLP for MoE; toggle QK-norm; register a new block variant.

  • Prepare tokenized data — two supported paths: pre-tokenize offline (e.g. with tatm) and validate with scripts/prepare_data.py, or stream from HuggingFace.

  • End-to-end training run — the flagship walkthrough: tokenize → write a config → launch 1 GPU → launch 4 GPUs → resume → generate.

Scale up

  • Scaling guide — 1 → 32 GPU journey, when to add TP / FSDP / EP / PP, batch-size scaling, MFU goals, common pitfalls.

  • SLURM distributed setup — single-node → multi-node, InfiniBand, NCCL env, preemption, auto-resume.

Operations

  • Run evaluationrun_eval() during training, standalone scripts/eval.py, scripts/eval_harness.py for lm-eval-harness.

  • Generate from checkpoint — load a DCP checkpoint, call generate() with temperature / top-k / top-p, interact with KVCache.

  • Debug training regressions — NaN detector, profiler, memory monitor, health checks, five failure shapes and how to read them.

Research knobs