kempnerforge.training.eval¶

Evaluation utilities for KempnerForge.

Provides run_eval for computing eval loss and perplexity on a held-out dataset. Works with any parallel model (FSDP, TP, PP) — same model reference, no unwrapping needed.

Functions

should_build_eval_dataloader(eval_enabled, ...)

Decide whether to build an eval dataloader and whether to warn.

kempnerforge.training.eval.should_build_eval_dataloader(eval_enabled, is_vlm)[source]¶

Decide whether to build an eval dataloader and whether to warn.

The training loop calls run_eval(model, eval_dataloader, ...) which invokes model(input_ids) — this does not match VLMWrapper.forward(pixel_values, input_ids, labels). VLM configs with eval.enabled=true would crash on the first eval interval. This helper gates the eval setup: for VLM configs it suppresses eval and flags that a warning should be logged so users see their eval setting was ignored. VLM eval support is a tracked follow-up.

Returns (should_build, should_warn_vlm_skip).

Parameters:

eval_enabled (bool)
is_vlm (bool)

Return type:

tuple[bool, bool]

kempnerforge.training.eval.run_eval(model, eval_dataloader, loss_fn, device, eval_steps, *, pp_schedule=None, pp_rank=None, pp_size=None, pp_group=None)¶

Run evaluation and return metrics.

Parameters:

model (torch.nn.Module) – The model (FSDP-wrapped, TP-sharded, or plain).
eval_dataloader (torch.utils.data.DataLoader) – DataLoader yielding {“input_ids”, “labels”} batches.
loss_fn (callable) – Loss function (logits, labels) -> scalar tensor.
device (torch.device) – Device to move batches to.
eval_steps (int) – Number of eval batches to process.
pp_schedule – Pipeline parallel schedule (None for non-PP).
pp_rank (int | None) – This rank’s PP stage index.
pp_size (int | None) – Total number of PP stages.
pp_group – Process group for PP loss broadcast.

Returns:

Dict with “eval/loss” and “eval/perplexity”.

Return type:

dict[str, float]