kempnerforge.metrics.mfu

Model FLOPs Utilization (MFU) computation.

Implements the PaLM paper formula for estimating achieved FLOPS relative to hardware peak, with auto-detection of GPU capabilities.

MFU = achieved_tflops / peak_tflops

Where:

model_flops_per_token = 6*P + 12*L*D*S (forward + backward) achieved_tflops = model_flops_per_token * tokens_per_sec / 1e12

Functions

compute_mfu(config, tokens_per_sec[, ...])

Compute Model FLOPs Utilization.

estimate_model_flops_per_token(config[, seq_len])

Estimate FLOPS per token for forward + backward pass.

get_gpu_peak_tflops([device])

Auto-detect GPU peak bf16 TFLOPS.

kempnerforge.metrics.mfu.get_gpu_peak_tflops(device=0)[source]

Auto-detect GPU peak bf16 TFLOPS.

Tries to match the GPU name against known models. Falls back to a conservative estimate based on compute capability.

Parameters:

device (int) – CUDA device index.

Returns:

Peak bf16 TFLOPS for this GPU.

Return type:

float

kempnerforge.metrics.mfu.estimate_model_flops_per_token(config, seq_len=None)[source]

Estimate FLOPS per token for forward + backward pass.

Uses the PaLM paper approximation: 6*P + 12*L*D*S

For MoE: uses active params (top_k experts per layer, not all experts). Excludes embedding (table lookup, not matmul). Includes output projection. The 12*L*D*S attention term does not discount GQA — FlashAttention expands GQA internally, so the hardware performs full attention compute. Router FLOPS (dim × num_experts) are intentionally omitted — negligible.

Parameters:
  • config (ModelConfig) – Model configuration.

  • seq_len (int | None) – Actual training sequence length. Falls back to config.max_seq_len if not provided.

Returns:

Estimated FLOPS per token.

Return type:

int

kempnerforge.metrics.mfu.compute_mfu(config, tokens_per_sec, num_gpus=1, gpu_peak_tflops=None, seq_len=None)[source]

Compute Model FLOPs Utilization.

Parameters:
  • config (ModelConfig) – Model configuration.

  • tokens_per_sec (float) – Global throughput (tokens/sec across all GPUs).

  • num_gpus (int) – Number of GPUs.

  • gpu_peak_tflops (float | None) – Peak bf16 TFLOPS per GPU. Auto-detected if None.

  • seq_len (int | None) – Actual training sequence length for attention FLOPS. Falls back to config.max_seq_len if not provided.

Returns:

MFU as a fraction (0.0 to 1.0).

Return type:

float