kempnerforge.resilience.signal_handler

Graceful shutdown via signal handling.

Handles SIGTERM (SLURM preemption / graceful shutdown) and SIGUSR1 (SLURM requeue) by setting a flag that the training loop checks after each step. On signal, the loop saves an emergency checkpoint and exits.

Timeout protection ensures the process exits even if graceful shutdown stalls (e.g., stuck in NCCL collective).

Classes

ShutdownHandler

Cooperative shutdown handler for long-running training jobs.

class kempnerforge.resilience.signal_handler.ShutdownHandler[source]

Bases: object

Cooperative shutdown handler for long-running training jobs.

Register this handler before the training loop. The training loop checks should_shutdown() after each step and takes appropriate action (save checkpoint, clean up, exit).

If the graceful shutdown exceeds timeout_sec, a forced exit is triggered via os._exit to avoid hanging on stuck collectives.

Usage:

handler = ShutdownHandler(timeout_sec=120)
handler.register()

for step in range(max_steps):
    train_step()
    if handler.should_shutdown():
        save_checkpoint()
        handler.finish()
        break
Parameters:

timeout_sec – Maximum seconds allowed for graceful shutdown before forced exit. Set to 0 to disable the timeout.

__init__(timeout_sec=600.0)[source]
Parameters:

timeout_sec (float)

Return type:

None

property shutdown_requested: bool

Whether a shutdown signal has been received.

property signal_received: Signals | None

The signal that triggered shutdown, or None.

should_shutdown()[source]

Check if the training loop should exit.

Call this after each training step.

Return type:

bool

register()[source]

Register signal handlers for SIGTERM and SIGUSR1.

Must be called from the main thread.

Return type:

None

unregister()[source]

Restore original signal handlers.

Return type:

None

finish()[source]

Call after graceful shutdown is complete.

Cancels the forced-exit timer and restores signal handlers.

Return type:

None