Resilience¶

Long training runs die — preempted, NaN’d, a GPU falls off the bus. KempnerForge’s resilience module handles the recoverable failures so the run picks up from the last checkpoint instead of burning a re-submit.

Three failure modes the module addresses:

Preemption / manual stop — SLURM sends a termination signal, KempnerForge catches it, writes an emergency checkpoint, and exits.
Numerical blow-up — NaN / Inf loss detected, optimizer step skipped, gradient zeroed. If it persists, stop so a human can roll back.
Silent distributed hangs — periodic NCCL liveness ping detects a dead peer before the training loop deadlocks.

At a glance¶

Component	Source	Wired into `train.py`?
`ShutdownHandler`	`resilience/signal_handler.py`	always-on
`NaNDetector`	`resilience/health.py`	hardcoded `action="warn"`
`check_nccl_health`	`resilience/health.py`	opt-in via `train.nccl_health_check_interval`
`check_gpu_health`	`resilience/health.py`	no — manual utility
`SLURMInfo` / `get_slurm_info` / `log_job_info`	`resilience/elastic.py`	`log_job_info()` at startup
`resolve_resume_path`	`resilience/elastic.py`	at checkpoint load time

Everything in the first column is importable from kempnerforge.resilience.

Config¶

[train]
shutdown_timeout_sec       = 600.0   # ShutdownHandler hard deadline (0 = disabled)
nccl_health_check_interval = 0       # NCCL ping every N steps (0 = disabled)

Two knobs. There is deliberately no nan_detection section — the action and max-consecutive count are hardcoded in scripts/train.py (action="warn", max_consecutive=10). Edit the script if you need different behavior; see NaN detection.

SLURM launch¶

The reference preemption-resilient launch script is scripts/slurm/7b_requeue.sh:

#SBATCH --signal=B:SIGTERM@120   # SIGTERM 120s before hard kill
#SBATCH --requeue                # auto-resubmit on preempt

srun --kill-on-bad-exit=1 uv run python scripts/train.py "${CONFIG}"

Pair with a checkpoint interval of a few hundred steps (~1.5 hours for a 7B run on 16 H100s), so the emergency checkpoint never loses more than that.

Resilience¶

At a glance¶

Config¶

SLURM launch¶

Pages¶

See also¶