End-to-end training run¶
This is the flagship how-to — the one that takes you from a clean checkout to a trained checkpoint and a sampled completion, using only what’s in the repo. If you can finish this page, every other how-to is an expansion.
Runnable plan:
Install the environment.
Cache the tokenizer.
Launch a 1-GPU run to confirm the loop works.
Scale to 4 GPUs on one node via
torchrun.Kill the job; auto-resume from the last checkpoint.
Generate text from the checkpoint.
The reference config we use throughout is
configs/train/hf_wikitext.toml
— a small (~40M-param) model that streams Wikitext-103 from
HuggingFace. No dataset setup required.
1. Install¶
git clone https://github.com/KempnerInstitute/KempnerForge.git
cd KempnerForge
uv sync # creates .venv and installs all deps
uv sync installs PyTorch, transformers, datasets, and the rest. If
uv isn’t on the machine, install it:
curl -LsSf https://astral.sh/uv/install.sh | sh.
2. Cache the tokenizer¶
The reference config uses the GPT-2 tokenizer. Compute nodes typically have restricted or much slower internet access (~1 Gbps vs. ~100 Gbps on login nodes), so it’s best to pre-cache it on the login node:
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('gpt2')"
Cached under ~/.cache/huggingface/. See
Prepare tokenized data § Cache the tokenizer first.
3. Single-GPU sanity check¶
uv run python scripts/train.py configs/train/hf_wikitext.toml
What this does:
load_configreads the TOML into aJobConfig.init_distributedinitializes process group (single-rank group on 1 GPU — still uses the distributed path).build_parallel_modelconstructs theTransformerand applies FSDP2 (dp_shard = -1auto-resolves to 1 on one rank).The training loop streams Wikitext-103, computes cross-entropy, steps AdamW with a cosine schedule, and checkpoints to
checkpoints/hf_wikitext/step_Nevery 100 steps.
Expected output (first few lines):
[rank 0] step 10 loss 10.42 lr 6.0e-05 tok/s 8,420 mfu 2.1%
[rank 0] step 20 loss 9.84 lr 1.2e-04 tok/s 8,510 mfu 2.1%
...
Let it run ~50 steps (loss should drop below 10), then Ctrl+C.
Note
MFU is low (~2%) because this is a 40M-param model on a single H100 — most of the runtime is framework + data loading, not matmul. MFU becomes meaningful at 7B+ scale. See Scaling guide.
4. Four-GPU run on one node¶
uv run torchrun --nproc_per_node=4 scripts/train.py configs/train/hf_wikitext.toml
torchrun spawns 4 processes, each binding one GPU. The config’s
dp_shard = -1 resolves to 4 — FSDP2 shards parameters + gradients
optimizer state across the 4 GPUs. Same loss curve, ~3.5× the tokens/sec (assumes 4× H100, bf16, Wikitext streaming fast enough to not bottleneck).
Watch the log: every rank reports per-step metrics, but only rank 0
writes checkpoints. By default, data appears in checkpoints/hf_wikitext/
under the current working directory.
5. Kill and auto-resume¶
KempnerForge catches SIGTERM and SIGUSR1 (the signals SLURM
sends on preemption / timeout) and writes an emergency checkpoint
before exiting. Ctrl+C sends SIGINT, which is not intercepted
— the process dies immediately and the last durable state is
whatever the interval-100 checkpoint saved.
To exercise the emergency path manually:
# In another shell, find the rank-0 pid:
kill -TERM <pid>
Expected rank-0 log:
Shutdown requested at step 247 — saving emergency checkpoint
Emergency checkpoint written to checkpoints/hf_wikitext/step_247
If you just hit Ctrl+C, you’ll resume from the last periodic save
instead (e.g., step_200), which is usually fine for dev work.
Relaunch the same command:
uv run torchrun --nproc_per_node=4 scripts/train.py configs/train/hf_wikitext.toml
CheckpointManager
follows the checkpoints/hf_wikitext/latest symlink (updated on
every save) and falls back to the highest step_N directory if the
symlink is missing. Training picks up at the next step with model,
optimizer, scheduler, dataloader, and RNG state restored.
The dataloader resumes from the exact sample via
StatefulDataLoader.load_state_dict + DistributedSampler.set_skip
(pre-tokenized path) or _skip_rank_docs (HF-streaming path), so no
sample is replayed and none is skipped. See
Checkpointing § Auto-resume and
Resilience § SLURM preemption.
6. Generate from the checkpoint¶
Once the loss is reasonable (< 7 on Wikitext), try generation:
uv run python scripts/generate.py configs/train/hf_wikitext.toml \
--checkpoint.load_path=checkpoints/hf_wikitext/latest \
--data.tokenizer_path=gpt2 \
--prompt "The Kempner Institute" \
--max_tokens 64 \
--temperature 0.8 \
--top_p 0.9
scripts/generate.py
is single-GPU, loads the DCP checkpoint into an un-sharded model,
tokenizes the prompt, calls
generate()
with KV-cache, and prints the decoded output.
Arguments:
Flag |
Default |
Purpose |
|---|---|---|
|
— |
TOML path |
|
— |
Path to a |
|
from config |
HF tokenizer ID or local path |
|
|
Input text |
|
|
Max new tokens |
|
|
Sampling temperature (0 = greedy) |
|
|
Top-k filtering (0 = off) |
|
|
Nucleus threshold |
|
|
REPL mode |
For interactive exploration:
uv run python scripts/generate.py configs/train/hf_wikitext.toml \
--checkpoint.load_path=checkpoints/hf_wikitext/latest \
--data.tokenizer_path=gpt2 \
--interactive
What you learned¶
You ran the full pipeline: config-driven build, FSDP2 sharding, stateful resumption, and KV-cache generation — all on data streamed from the hub without a pre-tokenization step.
Extensions from here:
Swap Wikitext for pre-tokenized shards → Prepare tokenized data.
Move to a bigger model and add TP / EP / PP → Scaling guide.
Launch from SLURM (single- or multi-node) → SLURM distributed setup.
Handle SLURM preemption → Resilience § SLURM preemption.
Go deeper on generation (KV cache, batching, samplers) → Generate from a checkpoint.
Debug NaN / OOM / hangs / slowdowns → Debug training regressions.
Run downstream benchmarks → Run evaluation.
Try FP8 → Distributed § FP8.
See also¶
Getting started — shorter install + quickstart for someone who just wants
uv sync && uv run ….Configuration overview — what the TOML schema looks like and how CLI overrides compose.
Checkpointing § Auto-resume — the resumption mechanics in detail.
Generation —
generate()internals and KV-cache API.Training loop — what
scripts/train.pyactually does at each step.