kempnerforge.metrics.tracker¶

Metrics collection, accumulation, and reporting.

MetricsTracker aggregates per-step metrics (loss, grad norm, throughput, MFU, memory) and dispatches them to configured logging backends (stdout, WandB, TensorBoard) at a configurable interval.

Classes

`MetricsTracker`	Collects, smooths, and reports training metrics.
`StepMetrics`	Metrics for a single training step.
`TensorBoardBackend`	TensorBoard logging backend.
`WandBBackend`	Weights & Biases logging backend.

class kempnerforge.metrics.tracker.StepMetrics[source]¶

Bases: object

Metrics for a single training step.

loss: float = 0.0¶

grad_norm: float = 0.0¶

lr: float = 0.0¶

tokens_per_sec: float = 0.0¶

mfu: float = 0.0¶

step_time_sec: float = 0.0¶

allocated_gb: float = 0.0¶

peak_gb: float = 0.0¶

reserved_gb: float = 0.0¶

total_gb: float = 0.0¶

mem_utilization: float = 0.0¶

__init__(loss=0.0, grad_norm=0.0, lr=0.0, tokens_per_sec=0.0, mfu=0.0, step_time_sec=0.0, allocated_gb=0.0, peak_gb=0.0, reserved_gb=0.0, total_gb=0.0, mem_utilization=0.0)¶

Parameters:

loss (float)
grad_norm (float)
lr (float)
tokens_per_sec (float)
mfu (float)
step_time_sec (float)
allocated_gb (float)
peak_gb (float)
reserved_gb (float)
total_gb (float)
mem_utilization (float)

Return type:

None

class kempnerforge.metrics.tracker.MetricsTracker[source]¶

Bases: object

Collects, smooths, and reports training metrics.

Timing is handled internally — call start_step() before and end_step() after each training step. Metrics are logged to all configured backends at the configured interval.

Parameters:

config – Full job config (used for MFU calculation and backend selection).
num_gpus – Number of GPUs for MFU denominator.
gpu_peak_tflops – Per-GPU peak TFLOPS. If None, auto-detected.

__init__(config, num_gpus=1, gpu_peak_tflops=None)[source]¶

Parameters:

config (JobConfig)
num_gpus (int)
gpu_peak_tflops (float | None)

Return type:

None

start_step()[source]¶

Mark the beginning of a training step.

Return type:: None

end_step(step, loss, grad_norm, lr, tokens_in_step)[source]¶

Mark the end of a training step and optionally log metrics.

Parameters:

step (int) – Current training step number.
loss (float) – Loss value for this step.
grad_norm (float) – Gradient norm (after clipping).
lr (float) – Current learning rate.
tokens_in_step (int) – Total tokens processed in this step (across all GPUs).

Returns:

StepMetrics if this step was a logging step, None otherwise.

Return type:

StepMetrics | None

log_eval(metrics, step)[source]¶

Log eval metrics to all backends and stdout.

Parameters:

metrics (dict[str, float])
step (int)

Return type:

None

init_backends(config)[source]¶

Initialize logging backends (call after distributed setup).

Parameters:: config (JobConfig)
Return type:: None

close()[source]¶

Flush and close all logging backends.

Return type:: None

class kempnerforge.metrics.tracker.WandBBackend[source]¶

Bases: _LoggingBackend

Weights & Biases logging backend.

Initializes a WandB run on first log call.

__init__(config)[source]¶

Parameters:: config (MetricsConfig)
Return type:: None

log(metrics, step)[source]¶

Parameters:

metrics (dict[str, float])
step (int)

Return type:

None

close()[source]¶

Return type:: None

class kempnerforge.metrics.tracker.TensorBoardBackend[source]¶

Bases: _LoggingBackend

TensorBoard logging backend.

__init__(config)[source]¶

Parameters:: config (MetricsConfig)
Return type:: None

log(metrics, step)[source]¶

Parameters:

metrics (dict[str, float])
step (int)

Return type:

None

close()[source]¶

Return type:: None