kempnerforge.metrics¶

Metrics, MFU computation, memory tracking, and logging for KempnerForge.

class kempnerforge.metrics.DeviceMemoryMonitor[source]¶

Bases: object

Tracks GPU memory usage across training steps.

Resets peak memory stats at each reporting interval so that the peak reflects per-interval usage rather than all-time peak.

Supports memory snapshot capture at a configurable step for debugging OOM and memory fragmentation with pytorch.org/memory_viz.

Parameters:

device – CUDA device index.
snapshot_step – Step at which to capture a memory snapshot. None to disable.
snapshot_dir – Directory to save snapshots.

__init__(device=0, snapshot_step=None, snapshot_dir='memory_snapshots')[source]¶

Parameters:

device (int)
snapshot_step (int | None)
snapshot_dir (str)

Return type:

None

report(step)[source]¶

Report memory stats for the current interval and reset peak.

Parameters:: step (int) – Current training step.
Returns:: Dict with memory stats.
Return type:: dict[str, float]

capture_snapshot(step)[source]¶

Capture a CUDA memory snapshot and save as pickle.

The snapshot can be visualized at https://pytorch.org/memory_viz

Parameters:: step (int) – Current step (used in filename).
Returns:: Path to the saved snapshot, or None if capture failed.
Return type:: str | None

class kempnerforge.metrics.MetricsTracker[source]¶

Bases: object

Collects, smooths, and reports training metrics.

Timing is handled internally — call start_step() before and end_step() after each training step. Metrics are logged to all configured backends at the configured interval.

Parameters:

config – Full job config (used for MFU calculation and backend selection).
num_gpus – Number of GPUs for MFU denominator.
gpu_peak_tflops – Per-GPU peak TFLOPS. If None, auto-detected.

__init__(config, num_gpus=1, gpu_peak_tflops=None)[source]¶

Parameters:

config (JobConfig)
num_gpus (int)
gpu_peak_tflops (float | None)

Return type:

None

start_step()[source]¶

Mark the beginning of a training step.

Return type:: None

end_step(step, loss, grad_norm, lr, tokens_in_step)[source]¶

Mark the end of a training step and optionally log metrics.

Parameters:

step (int) – Current training step number.
loss (float) – Loss value for this step.
grad_norm (float) – Gradient norm (after clipping).
lr (float) – Current learning rate.
tokens_in_step (int) – Total tokens processed in this step (across all GPUs).

Returns:

StepMetrics if this step was a logging step, None otherwise.

Return type:

StepMetrics | None

log_eval(metrics, step)[source]¶

Log eval metrics to all backends and stdout.

Parameters:

metrics (dict[str, float])
step (int)

Return type:

None

init_backends(config)[source]¶

Initialize logging backends (call after distributed setup).

Parameters:: config (JobConfig)
Return type:: None

close()[source]¶

Flush and close all logging backends.

Return type:: None

class kempnerforge.metrics.StepMetrics[source]¶

Bases: object

Metrics for a single training step.

loss: float = 0.0¶

grad_norm: float = 0.0¶

lr: float = 0.0¶

tokens_per_sec: float = 0.0¶

mfu: float = 0.0¶

step_time_sec: float = 0.0¶

allocated_gb: float = 0.0¶

peak_gb: float = 0.0¶

reserved_gb: float = 0.0¶

total_gb: float = 0.0¶

mem_utilization: float = 0.0¶

__init__(loss=0.0, grad_norm=0.0, lr=0.0, tokens_per_sec=0.0, mfu=0.0, step_time_sec=0.0, allocated_gb=0.0, peak_gb=0.0, reserved_gb=0.0, total_gb=0.0, mem_utilization=0.0)¶

Parameters:

loss (float)
grad_norm (float)
lr (float)
tokens_per_sec (float)
mfu (float)
step_time_sec (float)
allocated_gb (float)
peak_gb (float)
reserved_gb (float)
total_gb (float)
mem_utilization (float)

Return type:

None

kempnerforge.metrics.compute_mfu(config, tokens_per_sec, num_gpus=1, gpu_peak_tflops=None, seq_len=None)[source]¶

Compute Model FLOPs Utilization.

Parameters:

config (ModelConfig) – Model configuration.
tokens_per_sec (float) – Global throughput (tokens/sec across all GPUs).
num_gpus (int) – Number of GPUs.
gpu_peak_tflops (float | None) – Peak bf16 TFLOPS per GPU. Auto-detected if None.
seq_len (int | None) – Actual training sequence length for attention FLOPS. Falls back to config.max_seq_len if not provided.

Returns:

MFU as a fraction (0.0 to 1.0).

Return type:

float

kempnerforge.metrics.estimate_model_flops_per_token(config, seq_len=None)[source]¶

Estimate FLOPS per token for forward + backward pass.

Uses the PaLM paper approximation: 6*P + 12*L*D*S

For MoE: uses active params (top_k experts per layer, not all experts). Excludes embedding (table lookup, not matmul). Includes output projection. The 12*L*D*S attention term does not discount GQA — FlashAttention expands GQA internally, so the hardware performs full attention compute. Router FLOPS (dim × num_experts) are intentionally omitted — negligible.

Parameters:

config (ModelConfig) – Model configuration.
seq_len (int | None) – Actual training sequence length. Falls back to config.max_seq_len if not provided.

Returns:

Estimated FLOPS per token.

Return type:

int

kempnerforge.metrics.format_memory_stats(device=0)[source]¶

Format memory stats as a human-readable string.

Parameters:: device (int)
Return type:: str

kempnerforge.metrics.format_metrics(step, metrics)[source]¶

Format a metrics dict into a compact, color-coded log line.

Example output:: [step 1000] loss=2.34 | lr=3.00e-04 | grad_norm=1.2 | tok/s=125k | mfu=52.3% | mem=71.2/80GB

Parameters:

step (int)
metrics (dict[str, float | int | str])

Return type:

str

kempnerforge.metrics.get_gpu_peak_tflops(device=0)[source]¶

Auto-detect GPU peak bf16 TFLOPS.

Tries to match the GPU name against known models. Falls back to a conservative estimate based on compute capability.

Parameters:: device (int) – CUDA device index.
Returns:: Peak bf16 TFLOPS for this GPU.
Return type:: float

kempnerforge.metrics.get_logger(name, rank_zero_only=True)[source]¶

Get a logger for the given module name.

Parameters:

name (str) – Logger name (typically __name__).
rank_zero_only (bool) – If True, only rank 0 emits logs.

Returns:

A configured logging.Logger instance.

Return type:

Logger

kempnerforge.metrics.get_memory_stats(device=0)[source]¶

Get current GPU memory statistics in GB.

Parameters:: device (int) – CUDA device index.
Returns:: Dict with allocated, peak, reserved, and total memory in GB.
Return type:: dict[str, float]

kempnerforge.metrics.get_memory_utilization(device=0)[source]¶

Get peak memory utilization as a fraction of total GPU memory.

Returns:: Utilization between 0.0 and 1.0.
Parameters:: device (int)
Return type:: float

kempnerforge.metrics.reset_peak_memory(device=0)[source]¶

Reset peak memory tracking counter.

Parameters:: device (int)
Return type:: None

Modules

`logger`	Rank-aware logging utilities for KempnerForge.
`memory`	GPU memory tracking and reporting.
`mfu`	Model FLOPs Utilization (MFU) computation.
`tracker`	Metrics collection, accumulation, and reporting.