GPU health¶
check_gpu_health
runs a short smoke test against a CUDA device. The training loop
does not call it automatically — it’s an opt-in diagnostic, most
useful at job startup or after a suspected hardware fault.
What it checks¶
# kempnerforge/resilience/health.py — check_gpu_health
result = {
"cuda_available": torch.cuda.is_available(),
"device_accessible": False,
"compute_ok": False,
"memory_ok": False,
"error": "",
}
Four booleans + an error string. Each test must pass before the next runs:
cuda_available—torch.cuda.is_available(). If this is False the rest short-circuits.device_accessible—torch.cuda.set_device(device). Catches stale CUDA contexts or permission errors that letcuda_availablepass but block actual device use.compute_ok—x = torch.ones(16); y = x + x; assert y.sum() == 32. A tiny elementwise + reduction. Catches fused-kernel or launcher failures that look fine toset_devicebut crash on first op.memory_ok— allocate a 1 MB buffer (torch.empty(256*1024, dtype=float32)) and free it. Catches the case where the GPU is reachable and can launch kernels but OOMs on any new allocation (usually a stale allocator state).
Usage¶
from kempnerforge.resilience import check_gpu_health
health = check_gpu_health(device=0)
if not (health["cuda_available"] and health["compute_ok"] and health["memory_ok"]):
raise RuntimeError(f"GPU unhealthy: {health['error']}")
Or as a pre-flight check before long runs:
# At job start, before init_distributed
for device in range(torch.cuda.device_count()):
h = check_gpu_health(device)
if h["error"]:
logger.error(f"Device {device}: {h['error']}")
When to use it¶
After a hardware-suspected crash. If a run died with a CUDA error, re-check the device before resuming.
On cluster nodes you don’t own. Mixed-tenant clusters sometimes leave GPUs in a partially-wedged state; this surfaces that before your training burns a training-step of compute.
Before expensive data loading. Cheaper to fail at step 0 than 30 minutes into HF dataset streaming.
When it won’t help¶
Transient NCCL failures.
check_gpu_healthruns local ops only — no collective. For distributed liveness see NCCL liveness.Slow / degraded GPUs. The test passes if the GPU works, not if it’s running at spec. For throughput regressions use the profiler or watch
step_time_secin the metrics.Memory fragmentation under real load. A 1 MB allocation doesn’t trigger fragmentation; full training allocations might. Use memory snapshots for that.
Return shape¶
{
"cuda_available": True,
"device_accessible": True,
"compute_ok": True,
"memory_ok": True,
"error": "",
}
error is the string of the last-raised RuntimeError / AssertionError
if any test failed. cuda_available gets a special case — if False,
error is set to "CUDA not available" before returning. Otherwise
errors only appear if a test actually raised.
See also¶
NCCL liveness — the distributed-side counterpart; covers “GPUs are alive but not talking to each other”.
NaN detection — model-level failures rather than device-level.
Memory monitor — runtime memory tracking; complementary to the 1 MB allocation test here.