Getting Started¶
Four pages that take you from a fresh clone to a running model. Read in order if you’re new; skip around if you know what you’re looking for.
What each page covers¶
Install — prerequisites, uv sync, environment verification, and
SLURM-specific module setup for Kempner clusters.
Quickstart — a 5-minute walkthrough: debug run on one GPU, then multi-GPU, then point at your own data, then swap optimizer, then enable MoE, then extend via hooks. Every step is a single command.
Your First Training Run — slows down and explains what the debug run actually does: what the log line means, what’s in the checkpoint directory, how auto-resume finds the latest step, and what to change next.
Notebooks — summaries of the six interactive notebooks under
examples/notebooks/ and when to open each.
Prerequisites before you start¶
A Linux host with Python 3.12+ and at least one NVIDIA GPU (H100/H200/A100). CPU-only also works for the inspection notebooks; training steps will be slow.
uv installed. All commands in these pages use
uv run, which activates the project venv automatically — you do not need tosource .venv/bin/activateyourself.For multi-node runs on Kempner clusters: SLURM account and partition names from your PI or cluster admin.
If any of those are missing, start at Install.