kempnerforge.metrics

Metrics, MFU computation, memory tracking, and logging for KempnerForge.

class kempnerforge.metrics.DeviceMemoryMonitor[source]

Bases: object

Tracks GPU memory usage across training steps.

Resets peak memory stats at each reporting interval so that the peak reflects per-interval usage rather than all-time peak.

Supports memory snapshot capture at a configurable step for debugging OOM and memory fragmentation with pytorch.org/memory_viz.

Parameters:
  • device – CUDA device index.

  • snapshot_step – Step at which to capture a memory snapshot. None to disable.

  • snapshot_dir – Directory to save snapshots.

__init__(device=0, snapshot_step=None, snapshot_dir='memory_snapshots')[source]
Parameters:
  • device (int)

  • snapshot_step (int | None)

  • snapshot_dir (str)

Return type:

None

report(step)[source]

Report memory stats for the current interval and reset peak.

Parameters:

step (int) – Current training step.

Returns:

Dict with memory stats.

Return type:

dict[str, float]

capture_snapshot(step)[source]

Capture a CUDA memory snapshot and save as pickle.

The snapshot can be visualized at https://pytorch.org/memory_viz

Parameters:

step (int) – Current step (used in filename).

Returns:

Path to the saved snapshot, or None if capture failed.

Return type:

str | None

class kempnerforge.metrics.MetricsTracker[source]

Bases: object

Collects, smooths, and reports training metrics.

Timing is handled internally — call start_step() before and end_step() after each training step. Metrics are logged to all configured backends at the configured interval.

Parameters:
  • config – Full job config (used for MFU calculation and backend selection).

  • num_gpus – Number of GPUs for MFU denominator.

  • gpu_peak_tflops – Per-GPU peak TFLOPS. If None, auto-detected.

__init__(config, num_gpus=1, gpu_peak_tflops=None)[source]
Parameters:
Return type:

None

start_step()[source]

Mark the beginning of a training step.

Return type:

None

end_step(step, loss, grad_norm, lr, tokens_in_step)[source]

Mark the end of a training step and optionally log metrics.

Parameters:
  • step (int) – Current training step number.

  • loss (float) – Loss value for this step.

  • grad_norm (float) – Gradient norm (after clipping).

  • lr (float) – Current learning rate.

  • tokens_in_step (int) – Total tokens processed in this step (across all GPUs).

Returns:

StepMetrics if this step was a logging step, None otherwise.

Return type:

StepMetrics | None

log_eval(metrics, step)[source]

Log eval metrics to all backends and stdout.

Parameters:
Return type:

None

init_backends(config)[source]

Initialize logging backends (call after distributed setup).

Parameters:

config (JobConfig)

Return type:

None

close()[source]

Flush and close all logging backends.

Return type:

None

class kempnerforge.metrics.StepMetrics[source]

Bases: object

Metrics for a single training step.

loss: float = 0.0
grad_norm: float = 0.0
lr: float = 0.0
tokens_per_sec: float = 0.0
mfu: float = 0.0
step_time_sec: float = 0.0
allocated_gb: float = 0.0
peak_gb: float = 0.0
reserved_gb: float = 0.0
total_gb: float = 0.0
mem_utilization: float = 0.0
__init__(loss=0.0, grad_norm=0.0, lr=0.0, tokens_per_sec=0.0, mfu=0.0, step_time_sec=0.0, allocated_gb=0.0, peak_gb=0.0, reserved_gb=0.0, total_gb=0.0, mem_utilization=0.0)
Parameters:
Return type:

None

kempnerforge.metrics.compute_mfu(config, tokens_per_sec, num_gpus=1, gpu_peak_tflops=None, seq_len=None)[source]

Compute Model FLOPs Utilization.

Parameters:
  • config (ModelConfig) – Model configuration.

  • tokens_per_sec (float) – Global throughput (tokens/sec across all GPUs).

  • num_gpus (int) – Number of GPUs.

  • gpu_peak_tflops (float | None) – Peak bf16 TFLOPS per GPU. Auto-detected if None.

  • seq_len (int | None) – Actual training sequence length for attention FLOPS. Falls back to config.max_seq_len if not provided.

Returns:

MFU as a fraction (0.0 to 1.0).

Return type:

float

kempnerforge.metrics.estimate_model_flops_per_token(config, seq_len=None)[source]

Estimate FLOPS per token for forward + backward pass.

Uses the PaLM paper approximation: 6*P + 12*L*D*S

For MoE: uses active params (top_k experts per layer, not all experts). Excludes embedding (table lookup, not matmul). Includes output projection. The 12*L*D*S attention term does not discount GQA — FlashAttention expands GQA internally, so the hardware performs full attention compute. Router FLOPS (dim × num_experts) are intentionally omitted — negligible.

Parameters:
  • config (ModelConfig) – Model configuration.

  • seq_len (int | None) – Actual training sequence length. Falls back to config.max_seq_len if not provided.

Returns:

Estimated FLOPS per token.

Return type:

int

kempnerforge.metrics.format_memory_stats(device=0)[source]

Format memory stats as a human-readable string.

Parameters:

device (int)

Return type:

str

kempnerforge.metrics.format_metrics(step, metrics)[source]

Format a metrics dict into a compact, color-coded log line.

Example output:

[step 1000] loss=2.34 | lr=3.00e-04 | grad_norm=1.2 | tok/s=125k | mfu=52.3% | mem=71.2/80GB

Parameters:
Return type:

str

kempnerforge.metrics.get_gpu_peak_tflops(device=0)[source]

Auto-detect GPU peak bf16 TFLOPS.

Tries to match the GPU name against known models. Falls back to a conservative estimate based on compute capability.

Parameters:

device (int) – CUDA device index.

Returns:

Peak bf16 TFLOPS for this GPU.

Return type:

float

kempnerforge.metrics.get_logger(name, rank_zero_only=True)[source]

Get a logger for the given module name.

Parameters:
  • name (str) – Logger name (typically __name__).

  • rank_zero_only (bool) – If True, only rank 0 emits logs.

Returns:

A configured logging.Logger instance.

Return type:

Logger

kempnerforge.metrics.get_memory_stats(device=0)[source]

Get current GPU memory statistics in GB.

Parameters:

device (int) – CUDA device index.

Returns:

Dict with allocated, peak, reserved, and total memory in GB.

Return type:

dict[str, float]

kempnerforge.metrics.get_memory_utilization(device=0)[source]

Get peak memory utilization as a fraction of total GPU memory.

Returns:

Utilization between 0.0 and 1.0.

Parameters:

device (int)

Return type:

float

kempnerforge.metrics.reset_peak_memory(device=0)[source]

Reset peak memory tracking counter.

Parameters:

device (int)

Return type:

None

Modules

logger

Rank-aware logging utilities for KempnerForge.

memory

GPU memory tracking and reporting.

mfu

Model FLOPs Utilization (MFU) computation.

tracker

Metrics collection, accumulation, and reporting.