kempnerforge.resilience.health

GPU health monitoring and NaN detection.

Provides utilities for detecting common failures during training:
  • NaN/Inf in loss or gradients

  • GPU availability and basic health

  • NCCL liveness via lightweight collectives

Functions

check_gpu_health([device])

Run basic GPU health checks.

check_nccl_health([timeout_sec])

Check NCCL communication health via a lightweight all-reduce.

Classes

NaNDetector

Detects and tracks NaN/Inf values in loss and gradients.

NaNState

Tracks NaN/Inf occurrences across training steps.

class kempnerforge.resilience.health.NaNState[source]

Bases: object

Tracks NaN/Inf occurrences across training steps.

consecutive_nans: int = 0
total_nans: int = 0
last_good_loss: float = inf
last_good_step: int = 0
nan_steps: list[int]
__init__(consecutive_nans=0, total_nans=0, last_good_loss=inf, last_good_step=0, nan_steps=<factory>)
Parameters:
  • consecutive_nans (int)

  • total_nans (int)

  • last_good_loss (float)

  • last_good_step (int)

  • nan_steps (list[int])

Return type:

None

class kempnerforge.resilience.health.NaNDetector[source]

Bases: object

Detects and tracks NaN/Inf values in loss and gradients.

Supports three responses to NaN:
  • "warn": Log a warning and continue.

  • "skip": Skip the optimizer step (zero gradients).

  • "raise": Raise a RuntimeError.

If consecutive NaN count exceeds max_consecutive, the detector signals that a checkpoint rollback is recommended.

Parameters:
  • action – What to do when NaN is detected.

  • max_consecutive – Consecutive NaN steps before recommending rollback.

  • max_history – Number of NaN step indices to retain.

__init__(action='warn', max_consecutive=5, max_history=100)[source]
Parameters:
  • action (str)

  • max_consecutive (int)

  • max_history (int)

Return type:

None

check_loss(loss, step)[source]

Check a loss value for NaN/Inf.

When running distributed, all-reduces a NaN flag so ALL ranks agree on whether to skip. Prevents rank desync where one rank sees NaN and skips its optimizer step while others proceed normally.

Parameters:
  • loss (float) – The scalar loss value to check.

  • step (int) – Current training step.

Returns:

True if the loss is valid (finite) on ALL ranks, False if any rank has NaN/Inf.

Raises:

RuntimeError – If action is “raise” and NaN is detected.

Return type:

bool

check_gradients(model, step)[source]

Check model gradients for NaN/Inf before optimizer step.

Parameters:
Returns:

True if all gradients are finite.

Return type:

bool

property should_rollback: bool

Whether consecutive NaN count suggests a checkpoint rollback.

reset()[source]

Reset NaN tracking state (e.g., after a rollback).

Return type:

None

kempnerforge.resilience.health.check_gpu_health(device=0)[source]

Run basic GPU health checks.

Performs:
  1. CUDA availability check

  2. Small test computation on the device

  3. Memory allocation test

Returns:

Dict with health check results.

Parameters:

device (int)

Return type:

dict[str, bool | str]

kempnerforge.resilience.health.check_nccl_health(timeout_sec=10.0)[source]

Check NCCL communication health via a lightweight all-reduce.

Parameters:

timeout_sec (float) – Timeout for the collective operation.

Returns:

True if the all-reduce succeeded, False on timeout or error.

Return type:

bool