kempnerforge.metrics.memory¶

GPU memory tracking and reporting.

Provides utilities for monitoring GPU memory usage during training:

Functions

`format_memory_stats`([device])	Format memory stats as a human-readable string.
`get_memory_stats`([device])	Get current GPU memory statistics in GB.
`get_memory_utilization`([device])	Get peak memory utilization as a fraction of total GPU memory.
`reset_peak_memory`([device])	Reset peak memory tracking counter.

Classes

Tracks GPU memory usage across training steps.

kempnerforge.metrics.memory.get_memory_stats(device=0)[source]¶

Get current GPU memory statistics in GB.

kempnerforge.metrics.memory.get_memory_utilization(device=0)[source]¶

Get peak memory utilization as a fraction of total GPU memory.

kempnerforge.metrics.memory.format_memory_stats(device=0)[source]¶

Format memory stats as a human-readable string.

kempnerforge.metrics.memory.reset_peak_memory(device=0)[source]¶

Reset peak memory tracking counter.

class kempnerforge.metrics.memory.DeviceMemoryMonitor[source]¶

Tracks GPU memory usage across training steps.

Resets peak memory stats at each reporting interval so that the peak reflects per-interval usage rather than all-time peak.

Supports memory snapshot capture at a configurable step for debugging OOM and memory fragmentation with pytorch.org/memory_viz.

Parameters:

device – CUDA device index.
snapshot_step – Step at which to capture a memory snapshot. None to disable.
snapshot_dir – Directory to save snapshots.

__init__(device=0, snapshot_step=None, snapshot_dir='memory_snapshots')[source]¶

Parameters:

Return type:

None

report(step)[source]¶

Report memory stats for the current interval and reset peak.

capture_snapshot(step)[source]¶

Capture a CUDA memory snapshot and save as pickle.

The snapshot can be visualized at https://pytorch.org/memory_viz