Resilience¶
Long training runs die — preempted, NaN’d, a GPU falls off the bus. KempnerForge’s resilience module handles the recoverable failures so the run picks up from the last checkpoint instead of burning a re-submit.
Three failure modes the module addresses:
Preemption / manual stop — SLURM sends a termination signal, KempnerForge catches it, writes an emergency checkpoint, and exits.
Numerical blow-up — NaN / Inf loss detected, optimizer step skipped, gradient zeroed. If it persists, stop so a human can roll back.
Silent distributed hangs — periodic NCCL liveness ping detects a dead peer before the training loop deadlocks.
At a glance¶
Component |
Source |
Wired into |
|---|---|---|
|
|
always-on |
|
|
hardcoded |
|
|
opt-in via |
|
|
no — manual utility |
|
|
|
|
|
at checkpoint load time |
Everything in the first column is importable from
kempnerforge.resilience.
Config¶
[train]
shutdown_timeout_sec = 600.0 # ShutdownHandler hard deadline (0 = disabled)
nccl_health_check_interval = 0 # NCCL ping every N steps (0 = disabled)
Two knobs. There is deliberately no nan_detection section — the
action and max-consecutive count are hardcoded in scripts/train.py
(action="warn", max_consecutive=10). Edit the script if you need
different behavior; see NaN detection.
SLURM launch¶
The reference preemption-resilient launch script is
scripts/slurm/7b_requeue.sh:
#SBATCH --signal=B:SIGTERM@120 # SIGTERM 120s before hard kill
#SBATCH --requeue # auto-resubmit on preempt
srun --kill-on-bad-exit=1 uv run python scripts/train.py "${CONFIG}"
Pair with a checkpoint interval of a few hundred steps (~1.5 hours for a 7B run on 16 H100s), so the emergency checkpoint never loses more than that.
Pages¶
See also¶
Checkpointing § Auto-resume — what a requeued job does on startup.
Training § Training loop — where
shutdown_handler.should_shutdown()andnan_detector.check_loss()are polled each step.Configuration §
[train]— the two resilience-related config fields.