kempnerforge.metrics¶
Metrics, MFU computation, memory tracking, and logging for KempnerForge.
- class kempnerforge.metrics.DeviceMemoryMonitor[source]¶
Bases:
objectTracks GPU memory usage across training steps.
Resets peak memory stats at each reporting interval so that the peak reflects per-interval usage rather than all-time peak.
Supports memory snapshot capture at a configurable step for debugging OOM and memory fragmentation with pytorch.org/memory_viz.
- Parameters:
device – CUDA device index.
snapshot_step – Step at which to capture a memory snapshot. None to disable.
snapshot_dir – Directory to save snapshots.
- capture_snapshot(step)[source]¶
Capture a CUDA memory snapshot and save as pickle.
The snapshot can be visualized at https://pytorch.org/memory_viz
- class kempnerforge.metrics.MetricsTracker[source]¶
Bases:
objectCollects, smooths, and reports training metrics.
Timing is handled internally — call
start_step()before andend_step()after each training step. Metrics are logged to all configured backends at the configured interval.- Parameters:
config – Full job config (used for MFU calculation and backend selection).
num_gpus – Number of GPUs for MFU denominator.
gpu_peak_tflops – Per-GPU peak TFLOPS. If None, auto-detected.
- end_step(step, loss, grad_norm, lr, tokens_in_step)[source]¶
Mark the end of a training step and optionally log metrics.
- Parameters:
- Returns:
StepMetrics if this step was a logging step, None otherwise.
- Return type:
StepMetrics | None
- class kempnerforge.metrics.StepMetrics[source]¶
Bases:
objectMetrics for a single training step.
- __init__(loss=0.0, grad_norm=0.0, lr=0.0, tokens_per_sec=0.0, mfu=0.0, step_time_sec=0.0, allocated_gb=0.0, peak_gb=0.0, reserved_gb=0.0, total_gb=0.0, mem_utilization=0.0)¶
- kempnerforge.metrics.compute_mfu(config, tokens_per_sec, num_gpus=1, gpu_peak_tflops=None, seq_len=None)[source]¶
Compute Model FLOPs Utilization.
- Parameters:
config (ModelConfig) – Model configuration.
tokens_per_sec (float) – Global throughput (tokens/sec across all GPUs).
num_gpus (int) – Number of GPUs.
gpu_peak_tflops (float | None) – Peak bf16 TFLOPS per GPU. Auto-detected if None.
seq_len (int | None) – Actual training sequence length for attention FLOPS. Falls back to config.max_seq_len if not provided.
- Returns:
MFU as a fraction (0.0 to 1.0).
- Return type:
- kempnerforge.metrics.estimate_model_flops_per_token(config, seq_len=None)[source]¶
Estimate FLOPS per token for forward + backward pass.
Uses the PaLM paper approximation:
6*P + 12*L*D*SFor MoE: uses active params (top_k experts per layer, not all experts). Excludes embedding (table lookup, not matmul). Includes output projection. The 12*L*D*S attention term does not discount GQA — FlashAttention expands GQA internally, so the hardware performs full attention compute. Router FLOPS (dim × num_experts) are intentionally omitted — negligible.
- Parameters:
config (ModelConfig) – Model configuration.
seq_len (int | None) – Actual training sequence length. Falls back to config.max_seq_len if not provided.
- Returns:
Estimated FLOPS per token.
- Return type:
- kempnerforge.metrics.format_memory_stats(device=0)[source]¶
Format memory stats as a human-readable string.
- kempnerforge.metrics.format_metrics(step, metrics)[source]¶
Format a metrics dict into a compact, color-coded log line.
- Example output:
[step 1000] loss=2.34 | lr=3.00e-04 | grad_norm=1.2 | tok/s=125k | mfu=52.3% | mem=71.2/80GB
- kempnerforge.metrics.get_gpu_peak_tflops(device=0)[source]¶
Auto-detect GPU peak bf16 TFLOPS.
Tries to match the GPU name against known models. Falls back to a conservative estimate based on compute capability.
- kempnerforge.metrics.get_logger(name, rank_zero_only=True)[source]¶
Get a logger for the given module name.
- kempnerforge.metrics.get_memory_utilization(device=0)[source]¶
Get peak memory utilization as a fraction of total GPU memory.
- kempnerforge.metrics.reset_peak_memory(device=0)[source]¶
Reset peak memory tracking counter.
- Parameters:
device (int)
- Return type:
None
Modules