Quickstart¶

A 5-minute walkthrough that trains a tiny model so you can verify your install and see the training loop end-to-end.

Tip

When you change model.vocab_size, model.dim, or any other shape-affecting field between runs, use a fresh --checkpoint.dir or delete the old one first. train.py auto-resumes from the latest checkpoint in the directory, which will fail with a shape mismatch if the architecture changed. Examples below use /tmp/ paths so runs don’t collide.

1. Install¶

git clone git@github.com:KempnerInstitute/KempnerForge.git
cd KempnerForge
uv sync

If you want more detail on this step, see Install.

2. Run a 20M-parameter debug model on a single GPU¶

uv run python scripts/train.py configs/train/debug.toml \
  --checkpoint.dir=/tmp/kf_quickstart/step2

You should see per-step loss / MFU / step_time logs. The run takes under a minute. It uses synthetic data (no dataset download) — useful for sanity-checking the install before pointing at real data.

For a slower walkthrough of what this run does, what the log line means, and what ends up in the checkpoint directory, see Your First Training Run.

3. Multi-GPU on a single node (FSDP2)¶

uv run torchrun --nproc_per_node=4 scripts/train.py configs/train/debug.toml \
  --distributed.dp_shard=4 \
  --checkpoint.dir=/tmp/kf_quickstart/step3

4. Point at your own tokenized data¶

Pre-tokenized .bin or .npy shards work directly:

uv run torchrun --nproc_per_node=4 scripts/train.py configs/train/debug.toml \
  --data.dataset_path=/path/to/your/shards \
  --data.file_pattern='tokenized_*.bin' \
  --model.vocab_size=128256 \
  --checkpoint.dir=/tmp/kf_quickstart/step4

Or stream from HuggingFace:

uv run python scripts/train.py configs/train/hf_wikitext.toml \
  --checkpoint.dir=/tmp/kf_quickstart/step4_hf

5. Try a different optimizer¶

Swap AdamW for Muon without touching code:

uv run torchrun --nproc_per_node=4 scripts/train.py configs/train/debug.toml \
  --optimizer.name=muon \
  --checkpoint.dir=/tmp/kf_quickstart/step5

Available: adamw, muon, lion, schedule_free_adamw.

6. Enable MoE¶

uv run python scripts/train.py configs/train/debug_moe.toml \
  --checkpoint.dir=/tmp/kf_quickstart/step6

Or turn on MoE via CLI on the dense debug config:

uv run torchrun --nproc_per_node=4 scripts/train.py configs/train/debug.toml \
  --model.num_experts=8 --model.moe_top_k=2 --model.moe_router=sigmoid_topk \
  --checkpoint.dir=/tmp/kf_quickstart/step6_cli

7. Extend the training loop without forking `train.py`¶

See examples/custom_hook.py for four example hooks:

GradNormHistogramHook — per-layer gradient norms to WandB
LearningDynamicsHook — weight norms and gradient SNR
EarlyStoppingHook — stop if eval loss plateaus
ExpertLoadBalanceHook — MoE expert utilization metrics

Next steps¶

Understand the run: Your First Training Run explains the log line, the checkpoint directory layout, and auto-resume.
Interactive exploration: Notebooks lists six Jupyter notebooks (model inspection, attention visualization, activation extraction, checkpoint analysis, optimizer comparison, MoE routing).
Scale up: see README § Training Configurations for 7B / 13B / 70B configs.
Run on SLURM: see README § Quick Start for single- and multi-node launch scripts.
Measured performance: see benchmarks/mfu_scaling/mfu_scaling.md for MFU / throughput numbers across 1–32 GPUs.
Contribute: CONTRIBUTING.md walks through the issue → branch → PR flow.