kempnerforge.resilience.signal_handler¶
Graceful shutdown via signal handling.
Handles SIGTERM (SLURM preemption / graceful shutdown) and SIGUSR1 (SLURM requeue) by setting a flag that the training loop checks after each step. On signal, the loop saves an emergency checkpoint and exits.
Timeout protection ensures the process exits even if graceful shutdown stalls (e.g., stuck in NCCL collective).
Classes
Cooperative shutdown handler for long-running training jobs. |
- class kempnerforge.resilience.signal_handler.ShutdownHandler[source]¶
Bases:
objectCooperative shutdown handler for long-running training jobs.
Register this handler before the training loop. The training loop checks
should_shutdown()after each step and takes appropriate action (save checkpoint, clean up, exit).If the graceful shutdown exceeds
timeout_sec, a forced exit is triggered viaos._exitto avoid hanging on stuck collectives.Usage:
handler = ShutdownHandler(timeout_sec=120) handler.register() for step in range(max_steps): train_step() if handler.should_shutdown(): save_checkpoint() handler.finish() break
- Parameters:
timeout_sec – Maximum seconds allowed for graceful shutdown before forced exit. Set to 0 to disable the timeout.
- should_shutdown()[source]¶
Check if the training loop should exit.
Call this after each training step.
- Return type: