kempnerforge.metrics.tracker

Metrics collection, accumulation, and reporting.

MetricsTracker aggregates per-step metrics (loss, grad norm, throughput, MFU, memory) and dispatches them to configured logging backends (stdout, WandB, TensorBoard) at a configurable interval.

Classes

MetricsTracker

Collects, smooths, and reports training metrics.

StepMetrics

Metrics for a single training step.

TensorBoardBackend

TensorBoard logging backend.

WandBBackend

Weights & Biases logging backend.

class kempnerforge.metrics.tracker.StepMetrics[source]

Bases: object

Metrics for a single training step.

loss: float = 0.0
grad_norm: float = 0.0
lr: float = 0.0
tokens_per_sec: float = 0.0
mfu: float = 0.0
step_time_sec: float = 0.0
allocated_gb: float = 0.0
peak_gb: float = 0.0
reserved_gb: float = 0.0
total_gb: float = 0.0
mem_utilization: float = 0.0
__init__(loss=0.0, grad_norm=0.0, lr=0.0, tokens_per_sec=0.0, mfu=0.0, step_time_sec=0.0, allocated_gb=0.0, peak_gb=0.0, reserved_gb=0.0, total_gb=0.0, mem_utilization=0.0)
Parameters:
Return type:

None

class kempnerforge.metrics.tracker.MetricsTracker[source]

Bases: object

Collects, smooths, and reports training metrics.

Timing is handled internally — call start_step() before and end_step() after each training step. Metrics are logged to all configured backends at the configured interval.

Parameters:
  • config – Full job config (used for MFU calculation and backend selection).

  • num_gpus – Number of GPUs for MFU denominator.

  • gpu_peak_tflops – Per-GPU peak TFLOPS. If None, auto-detected.

__init__(config, num_gpus=1, gpu_peak_tflops=None)[source]
Parameters:
Return type:

None

start_step()[source]

Mark the beginning of a training step.

Return type:

None

end_step(step, loss, grad_norm, lr, tokens_in_step)[source]

Mark the end of a training step and optionally log metrics.

Parameters:
  • step (int) – Current training step number.

  • loss (float) – Loss value for this step.

  • grad_norm (float) – Gradient norm (after clipping).

  • lr (float) – Current learning rate.

  • tokens_in_step (int) – Total tokens processed in this step (across all GPUs).

Returns:

StepMetrics if this step was a logging step, None otherwise.

Return type:

StepMetrics | None

log_eval(metrics, step)[source]

Log eval metrics to all backends and stdout.

Parameters:
Return type:

None

init_backends(config)[source]

Initialize logging backends (call after distributed setup).

Parameters:

config (JobConfig)

Return type:

None

close()[source]

Flush and close all logging backends.

Return type:

None

class kempnerforge.metrics.tracker.WandBBackend[source]

Bases: _LoggingBackend

Weights & Biases logging backend.

Initializes a WandB run on first log call.

__init__(config)[source]
Parameters:

config (MetricsConfig)

Return type:

None

log(metrics, step)[source]
Parameters:
Return type:

None

close()[source]
Return type:

None

class kempnerforge.metrics.tracker.TensorBoardBackend[source]

Bases: _LoggingBackend

TensorBoard logging backend.

__init__(config)[source]
Parameters:

config (MetricsConfig)

Return type:

None

log(metrics, step)[source]
Parameters:
Return type:

None

close()[source]
Return type:

None