kempnerforge.resilience¶
Fault tolerance and resilience for KempnerForge.
- class kempnerforge.resilience.NaNDetector[source]¶
Bases:
objectDetects and tracks NaN/Inf values in loss and gradients.
- Supports three responses to NaN:
"warn": Log a warning and continue."skip": Skip the optimizer step (zero gradients)."raise": Raise aRuntimeError.
If consecutive NaN count exceeds
max_consecutive, the detector signals that a checkpoint rollback is recommended.- Parameters:
action – What to do when NaN is detected.
max_consecutive – Consecutive NaN steps before recommending rollback.
max_history – Number of NaN step indices to retain.
- check_loss(loss, step)[source]¶
Check a loss value for NaN/Inf.
When running distributed, all-reduces a NaN flag so ALL ranks agree on whether to skip. Prevents rank desync where one rank sees NaN and skips its optimizer step while others proceed normally.
- Parameters:
- Returns:
True if the loss is valid (finite) on ALL ranks, False if any rank has NaN/Inf.
- Raises:
RuntimeError – If action is “raise” and NaN is detected.
- Return type:
- check_gradients(model, step)[source]¶
Check model gradients for NaN/Inf before optimizer step.
- Parameters:
model (torch.nn.Module) – The model to check.
step (int) – Current training step.
- Returns:
True if all gradients are finite.
- Return type:
- class kempnerforge.resilience.NaNState[source]¶
Bases:
objectTracks NaN/Inf occurrences across training steps.
- class kempnerforge.resilience.SLURMInfo[source]¶
Bases:
objectInformation about the current SLURM job.
- class kempnerforge.resilience.ShutdownHandler[source]¶
Bases:
objectCooperative shutdown handler for long-running training jobs.
Register this handler before the training loop. The training loop checks
should_shutdown()after each step and takes appropriate action (save checkpoint, clean up, exit).If the graceful shutdown exceeds
timeout_sec, a forced exit is triggered viaos._exitto avoid hanging on stuck collectives.Usage:
handler = ShutdownHandler(timeout_sec=120) handler.register() for step in range(max_steps): train_step() if handler.should_shutdown(): save_checkpoint() handler.finish() break
- Parameters:
timeout_sec – Maximum seconds allowed for graceful shutdown before forced exit. Set to 0 to disable the timeout.
- should_shutdown()[source]¶
Check if the training loop should exit.
Call this after each training step.
- Return type:
- kempnerforge.resilience.check_gpu_health(device=0)[source]¶
Run basic GPU health checks.
- Performs:
CUDA availability check
Small test computation on the device
Memory allocation test
- kempnerforge.resilience.check_nccl_health(timeout_sec=10.0)[source]¶
Check NCCL communication health via a lightweight all-reduce.
- kempnerforge.resilience.get_slurm_info()[source]¶
Read SLURM job information from environment variables.
- Returns:
SLURMInfo if running under SLURM, None otherwise.
- Return type:
SLURMInfo | None
- kempnerforge.resilience.is_slurm_requeue()[source]¶
Check if this is a requeued SLURM job.
Uses
SLURM_RESTART_COUNT(set by SLURM on requeue).- Return type:
- kempnerforge.resilience.log_job_info()[source]¶
Log SLURM job information (if running under SLURM).
- Return type:
None
- kempnerforge.resilience.resolve_resume_path(checkpoint_dir)[source]¶
Find the latest checkpoint for auto-resume.
- Checks:
{checkpoint_dir}/latestsymlinkMost recent
step_Ndirectory by step number
Modules
Elastic training and SLURM integration helpers. |
|
GPU health monitoring and NaN detection. |
|
Graceful shutdown via signal handling. |