kempnerforge.metrics.tracker¶
Metrics collection, accumulation, and reporting.
MetricsTracker aggregates per-step metrics (loss, grad norm, throughput, MFU, memory) and dispatches them to configured logging backends (stdout, WandB, TensorBoard) at a configurable interval.
Classes
Collects, smooths, and reports training metrics. |
|
Metrics for a single training step. |
|
TensorBoard logging backend. |
|
Weights & Biases logging backend. |
- class kempnerforge.metrics.tracker.StepMetrics[source]¶
Bases:
objectMetrics for a single training step.
- __init__(loss=0.0, grad_norm=0.0, lr=0.0, tokens_per_sec=0.0, mfu=0.0, step_time_sec=0.0, allocated_gb=0.0, peak_gb=0.0, reserved_gb=0.0, total_gb=0.0, mem_utilization=0.0)¶
- class kempnerforge.metrics.tracker.MetricsTracker[source]¶
Bases:
objectCollects, smooths, and reports training metrics.
Timing is handled internally — call
start_step()before andend_step()after each training step. Metrics are logged to all configured backends at the configured interval.- Parameters:
config – Full job config (used for MFU calculation and backend selection).
num_gpus – Number of GPUs for MFU denominator.
gpu_peak_tflops – Per-GPU peak TFLOPS. If None, auto-detected.
- end_step(step, loss, grad_norm, lr, tokens_in_step)[source]¶
Mark the end of a training step and optionally log metrics.
- Parameters:
- Returns:
StepMetrics if this step was a logging step, None otherwise.
- Return type:
StepMetrics | None
- class kempnerforge.metrics.tracker.WandBBackend[source]¶
Bases:
_LoggingBackendWeights & Biases logging backend.
Initializes a WandB run on first log call.
- __init__(config)[source]¶
- Parameters:
config (MetricsConfig)
- Return type:
None
- class kempnerforge.metrics.tracker.TensorBoardBackend[source]¶
Bases:
_LoggingBackendTensorBoard logging backend.
- __init__(config)[source]¶
- Parameters:
config (MetricsConfig)
- Return type:
None