kempnerforge.resilience.health¶

GPU health monitoring and NaN detection.

Provides utilities for detecting common failures during training:

NaN/Inf in loss or gradients
GPU availability and basic health
NCCL liveness via lightweight collectives

Functions

`check_gpu_health`([device])	Run basic GPU health checks.
`check_nccl_health`([timeout_sec])	Check NCCL communication health via a lightweight all-reduce.

Classes

`NaNDetector`	Detects and tracks NaN/Inf values in loss and gradients.
`NaNState`	Tracks NaN/Inf occurrences across training steps.

class kempnerforge.resilience.health.NaNState[source]¶

Bases: object

Tracks NaN/Inf occurrences across training steps.

consecutive_nans: int = 0¶

total_nans: int = 0¶

last_good_loss: float = inf¶

last_good_step: int = 0¶

nan_steps: list[int]¶

__init__(consecutive_nans=0, total_nans=0, last_good_loss=inf, last_good_step=0, nan_steps=<factory>)¶

Parameters:

consecutive_nans (int)
total_nans (int)
last_good_loss (float)
last_good_step (int)
nan_steps (list[int])

Return type:

None

class kempnerforge.resilience.health.NaNDetector[source]¶

Bases: object

Detects and tracks NaN/Inf values in loss and gradients.

Supports three responses to NaN:

"warn": Log a warning and continue.
"skip": Skip the optimizer step (zero gradients).
"raise": Raise a RuntimeError.

If consecutive NaN count exceeds max_consecutive, the detector signals that a checkpoint rollback is recommended.

Parameters:

action – What to do when NaN is detected.
max_consecutive – Consecutive NaN steps before recommending rollback.
max_history – Number of NaN step indices to retain.

__init__(action='warn', max_consecutive=5, max_history=100)[source]¶

Parameters:

action (str)
max_consecutive (int)
max_history (int)

Return type:

None

check_loss(loss, step)[source]¶

Check a loss value for NaN/Inf.

When running distributed, all-reduces a NaN flag so ALL ranks agree on whether to skip. Prevents rank desync where one rank sees NaN and skips its optimizer step while others proceed normally.

Parameters:

loss (float) – The scalar loss value to check.
step (int) – Current training step.

Returns:

True if the loss is valid (finite) on ALL ranks, False if any rank has NaN/Inf.

Raises:

RuntimeError – If action is “raise” and NaN is detected.

Return type:

bool

check_gradients(model, step)[source]¶

Check model gradients for NaN/Inf before optimizer step.

Parameters:

model (torch.nn.Module) – The model to check.
step (int) – Current training step.

Returns:

True if all gradients are finite.

Return type:

bool

property should_rollback: bool¶: Whether consecutive NaN count suggests a checkpoint rollback.

reset()[source]¶

Reset NaN tracking state (e.g., after a rollback).

Return type:: None

kempnerforge.resilience.health.check_gpu_health(device=0)[source]¶

Run basic GPU health checks.

Performs:

CUDA availability check
Small test computation on the device
Memory allocation test

Returns:: Dict with health check results.
Parameters:: device (int)
Return type:: dict[str, bool | str]

kempnerforge.resilience.health.check_nccl_health(timeout_sec=10.0)[source]¶

Check NCCL communication health via a lightweight all-reduce.

The all-reduce runs with async_op=True so work.wait(timeout=...) enforces the caller’s bound rather than falling back to the process-group default timeout (nccl_timeout_sec, 1800s). Without that, this function would sit for 30 minutes on a single stuck peer regardless of the timeout_sec argument.

Parameters:: timeout_sec (float) – Per-operation timeout for the collective. Returns False if the all-reduce does not complete within this budget.
Returns:: True on success, False on timeout, error, or world-size mismatch.
Return type:: bool