kempnerforge.metrics.memory

GPU memory tracking and reporting.

Provides utilities for monitoring GPU memory usage during training:
  • Current / peak / reserved memory

  • Memory utilization as a percentage of total

  • Human-readable formatting

Functions

format_memory_stats([device])

Format memory stats as a human-readable string.

get_memory_stats([device])

Get current GPU memory statistics in GB.

get_memory_utilization([device])

Get peak memory utilization as a fraction of total GPU memory.

reset_peak_memory([device])

Reset peak memory tracking counter.

Classes

DeviceMemoryMonitor

Tracks GPU memory usage across training steps.

kempnerforge.metrics.memory.get_memory_stats(device=0)[source]

Get current GPU memory statistics in GB.

Parameters:

device (int) – CUDA device index.

Returns:

Dict with allocated, peak, reserved, and total memory in GB.

Return type:

dict[str, float]

kempnerforge.metrics.memory.get_memory_utilization(device=0)[source]

Get peak memory utilization as a fraction of total GPU memory.

Returns:

Utilization between 0.0 and 1.0.

Parameters:

device (int)

Return type:

float

kempnerforge.metrics.memory.format_memory_stats(device=0)[source]

Format memory stats as a human-readable string.

Parameters:

device (int)

Return type:

str

kempnerforge.metrics.memory.reset_peak_memory(device=0)[source]

Reset peak memory tracking counter.

Parameters:

device (int)

Return type:

None

class kempnerforge.metrics.memory.DeviceMemoryMonitor[source]

Bases: object

Tracks GPU memory usage across training steps.

Resets peak memory stats at each reporting interval so that the peak reflects per-interval usage rather than all-time peak.

Supports memory snapshot capture at a configurable step for debugging OOM and memory fragmentation with pytorch.org/memory_viz.

Parameters:
  • device – CUDA device index.

  • snapshot_step – Step at which to capture a memory snapshot. None to disable.

  • snapshot_dir – Directory to save snapshots.

__init__(device=0, snapshot_step=None, snapshot_dir='memory_snapshots')[source]
Parameters:
  • device (int)

  • snapshot_step (int | None)

  • snapshot_dir (str)

Return type:

None

report(step)[source]

Report memory stats for the current interval and reset peak.

Parameters:

step (int) – Current training step.

Returns:

Dict with memory stats.

Return type:

dict[str, float]

capture_snapshot(step)[source]

Capture a CUDA memory snapshot and save as pickle.

The snapshot can be visualized at https://pytorch.org/memory_viz

Parameters:

step (int) – Current step (used in filename).

Returns:

Path to the saved snapshot, or None if capture failed.

Return type:

str | None