kempnerforge.resilience.health¶
GPU health monitoring and NaN detection.
- Provides utilities for detecting common failures during training:
NaN/Inf in loss or gradients
GPU availability and basic health
NCCL liveness via lightweight collectives
Functions
|
Run basic GPU health checks. |
|
Check NCCL communication health via a lightweight all-reduce. |
Classes
Detects and tracks NaN/Inf values in loss and gradients. |
|
Tracks NaN/Inf occurrences across training steps. |
- class kempnerforge.resilience.health.NaNState[source]¶
Bases:
objectTracks NaN/Inf occurrences across training steps.
- class kempnerforge.resilience.health.NaNDetector[source]¶
Bases:
objectDetects and tracks NaN/Inf values in loss and gradients.
- Supports three responses to NaN:
"warn": Log a warning and continue."skip": Skip the optimizer step (zero gradients)."raise": Raise aRuntimeError.
If consecutive NaN count exceeds
max_consecutive, the detector signals that a checkpoint rollback is recommended.- Parameters:
action – What to do when NaN is detected.
max_consecutive – Consecutive NaN steps before recommending rollback.
max_history – Number of NaN step indices to retain.
- check_loss(loss, step)[source]¶
Check a loss value for NaN/Inf.
When running distributed, all-reduces a NaN flag so ALL ranks agree on whether to skip. Prevents rank desync where one rank sees NaN and skips its optimizer step while others proceed normally.
- Parameters:
- Returns:
True if the loss is valid (finite) on ALL ranks, False if any rank has NaN/Inf.
- Raises:
RuntimeError – If action is “raise” and NaN is detected.
- Return type:
- check_gradients(model, step)[source]¶
Check model gradients for NaN/Inf before optimizer step.
- Parameters:
model (torch.nn.Module) – The model to check.
step (int) – Current training step.
- Returns:
True if all gradients are finite.
- Return type: