SLURM distributed setup¶
KempnerForge ships three SLURM launch scripts under
scripts/slurm/:
Script |
Launcher |
Scope |
|---|---|---|
|
|
1 node, N GPUs |
|
|
2+ nodes, N×M GPUs |
|
|
attach to |
This page explains when to use each, what environment the scripts assemble, and how preemption + auto-resume compose with them.
Single node: singlenode.sh¶
sbatch scripts/slurm/singlenode.sh configs/train/7b.toml
sbatch scripts/slurm/singlenode.sh configs/train/7b.toml --train.max_steps=1000
The script:
Reads
SLURM_GPUS_PER_NODE(default 4) to buildtorchrun --nproc_per_node=$NGPUSDetects the first UP InfiniBand interface (
ip -br addr) and exportsNCCL_SOCKET_IFNAME/GLOO_SOCKET_IFNAMEto itLaunches
torchrun --standaloneso there’s no rendezvous coordination needed (single-host)
Edit the header for your cluster:
#SBATCH --partition=<partition-name>
#SBATCH --account=<account-name>
#SBATCH --gpus-per-node=4
#SBATCH --time=24:00:00
Multi-node: multinode.sh¶
sbatch --nodes=4 scripts/slurm/multinode.sh configs/train/7b.toml
sbatch --nodes=8 scripts/slurm/multinode.sh \
configs/train/7b.toml --train.max_steps=50000
Two things make multi-node different from single-node on this codebase:
Launch method is
srundirect, nottorchrun. Each srun task is one process bound to one GPU. SLURM already setsSLURM_PROCID/SLURM_LOCALID/SLURM_NTASKS, andget_world_infomaps them toRANK/LOCAL_RANK/WORLD_SIZE. No torchrun in the loop.--ntasks-per-nodemust equal--gpus-per-node. This is the load-bearing invariant. Violate it and local rank maps incorrectly, processes land on the wrong GPU, and NCCL fails silently or crashes at first collective.
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4 # ← must match gpus-per-node
#SBATCH --gpus-per-node=4
#SBATCH --signal=B:SIGTERM@120 # SLURM sends SIGTERM 120s before time-limit
#SBATCH --requeue # auto-resubmit on preemption
The script then:
Extracts
MASTER_ADDRviascontrol show hostnames $SLURM_JOB_NODELIST | head -n 1Picks a random free port in
[15000, 20000]forMASTER_PORT(so two jobs on the same node don’t collide)Detects the first UP IB interface and binds both NCCL and Gloo to it
Launches
srun uv run python scripts/train.py $CONFIG
Why Gloo needs the same IB interface¶
Async DCP checkpointing uses a CPU-side Gloo process group for
coordination. Without GLOO_SOCKET_IFNAME pointed at the IB
interface, Gloo falls back to the management Ethernet (em4 on
Kempner nodes), which is on a different subnet — peer connections
time out and the checkpoint never completes.
NCCL environment the script sets¶
NCCL_SOCKET_IFNAME=ib0 # OOB bootstrap + socket transport
NCCL_IB_DISABLE=0 # enable IB verbs transport
NCCL_NET_GDR_LEVEL=2 # GPU-direct RDMA
NCCL_IB_GID_INDEX=3 # H100/H200 fabric config
NCCL_TIMEOUT=1800 # 30 minutes — raise for slow collectives
GLOO_SOCKET_IFNAME=ib0
NCCL_IB_GID_INDEX=3 is specific to the RoCE/IB fabric on Kempner
H100/H200 nodes. On other clusters this may need to be 0 or 1 — check
with your admins or ibv_devinfo.
Interactive / existing allocation: interactive.sh¶
When you already have an allocation (salloc or a queued job) and
want to attach training to it without a new sbatch:
# Pass the JOBID of the existing allocation, then the config:
bash scripts/slurm/interactive.sh 3565401 configs/train/debug.toml
bash scripts/slurm/interactive.sh 3565401 configs/train/7b.toml --train.max_steps=50
The script resolves NODELIST / MASTER_ADDR via squeue -j $JOBID
and scontrol show hostnames, auto-detects the IB interface, then
dispatches srun --jobid=$JOBID with --ntasks-per-node=$GPUS_PER_NODE.
Useful for debugging on a reserved node without going through the
queue again.
Preemption, SIGTERM, auto-resume¶
The full mechanics live in Resilience § SLURM preemption; the short version:
#SBATCH --signal=B:SIGTERM@120tells SLURM: “send SIGTERM 120 seconds before the wallclock time limit, to rank 0 (B:= batch script process).”The training loop’s
ShutdownHandlercatches SIGTERM, flipsshould_shutdown()toTrue, and the step loop writes an emergency DCP checkpoint before exiting cleanly.#SBATCH --requeueputs the job back in the queue. When it starts again,CheckpointManagerfollows thelatestsymlink incheckpoint.dirand resumes at the exact step / sample.
The 120-second window is a design parameter: emergency checkpoints
must finish within it or the job is SIGKILL’d. Async DCP save times
depend on model size, FSDP degree, and filesystem — measure on your
cluster with checkpoint.async_save = true and a test run before
trusting the default for 70B+. If your save consistently overruns,
raise #SBATCH --signal=B:SIGTERM@<seconds> accordingly.
Auto-resume in practice¶
With --requeue set, you can monitor a preempted-and-restarted job:
squeue -j <jobid> # see current state
sacct -j <jobid> --format=JobID,State,ExitCode,Start,End
scontrol show job <jobid> | grep RestartCount
On first launch, SLURM_RESTART_COUNT is 0; after a requeue it
increments. The script echoes this so the log makes it obvious:
Restart cnt: 2
If you see rising restart counts but training never crosses step
eval_interval, something is preempting the job faster than it can
checkpoint — raise the #SBATCH --signal window or drop
checkpoint.interval.
Checking the launch worked¶
First time you run on a new cluster, verify:
All ranks start — the script’s banner line should print once per rank, and rank 0 log should show
world_size = N × M.NCCL talks over IB — set
NCCL_DEBUG=INFOin the script once to see the negotiated transport. Each rank should logNET/IB(notNET/Socket).First all-reduce succeeds — rank 0 reports model build and step 1 loss within a minute or two. A multi-minute hang at step 1 usually means Gloo is on the wrong interface (DCP init fails) or IB GID is wrong.
If step 1 hangs, kill the job and re-run with:
export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
What the SLURM scripts don’t do¶
Tokenizer caching — you still need to pre-cache HuggingFace tokenizers on a login node (compute nodes are usually air-gapped). See Prepare tokenized data § Cache the tokenizer first.
Data staging —
configs/train/*.tomlexpects your dataset atdata.dataset_path. Pre-copy or symlink beforesbatch.Env mounts — the scripts assume
uv run pythonresolves the repo’s.venv. If your cluster uses different mount points, edit theuv runlines.
See also¶
Resilience § SLURM preemption — the full
ShutdownHandlertimeline and SLURM requeue dance.Distributed § DeviceMesh — how
init_distributedturns rank info into a mesh.End-to-end training run — runs through the single-node path before this page scales it.
Scaling guide — which parallelism combo to pick at each GPU count; this page tells you how to launch it.
scripts/slurm/multinode.sh— the reference script this page documents.