Run evaluation¶
Three entry points, three use cases:
Entry point |
When to use |
|---|---|
|
Periodic eval loss during a training run |
|
Standalone eval on a saved checkpoint — quick loss/perplexity on new data |
|
Downstream benchmarks (HellaSwag, ARC, MMLU, …) via lm-eval-harness |
All three compute through the same underlying
run_eval
function when they need loss; the harness path additionally converts
the DCP checkpoint to HuggingFace format and hands off to lm_eval.
Path 1: eval inside the training loop¶
Flip on [eval] in your TOML:
[eval]
enabled = true
interval = 500 # steps between evals
steps = 50 # batches per eval
dataset_path = "/data/eval_set"
file_pattern = "*.bin"
# OR — HuggingFace eval data:
# hf_dataset_name = "wikitext"
# hf_dataset_config = "wikitext-103-raw-v1"
# hf_dataset_split = "validation"
Every interval steps, the training loop pauses the optimizer,
switches the model to eval mode, iterates steps batches of held-out
data, and logs eval/loss and eval/perplexity through
MetricsTracker.log_eval.
The stdout backend renders one compact line per call:
[step 500] eval/loss=3.8200 | eval/perplexity=45.6000
The eval dataloader is a plain DataLoader with a DistributedSampler
(not StatefulDataLoader). It’s deterministic because it doesn’t
shuffle, and it’s allowed to reset on epoch boundary — eval iteration
uses StopIteration → re-init to wrap around if steps exceeds the
dataset.
Perplexity is capped¶
{"eval/loss": avg_loss, "eval/perplexity": math.exp(min(avg_loss, 20.0))}
exp(20) ≈ 4.85e8. When the model hasn’t converged yet and loss is
huge, exp would overflow — so the clamp keeps the number finite.
Treat perplexity ≈ 5e8 as “loss is still blowing up,” not a real
perplexity.
PP integration¶
run_eval auto-detects pipeline parallelism via its pp_schedule
argument. On PP stages, eval runs through the same
schedule.step(input, target, losses) machinery as training; the
last stage broadcasts the loss back to rank 0 so logging stays
consistent.
Path 2: standalone eval on a checkpoint¶
# Single GPU
uv run python scripts/eval.py configs/train/7b.toml \
--checkpoint.load_path=checkpoints/7b/step_10000 \
--eval.dataset_path=/data/eval_set \
--eval.steps=100
# Multi-GPU (FSDP for large models)
uv run torchrun --nproc_per_node=4 scripts/eval.py configs/train/7b.toml \
--checkpoint.load_path=checkpoints/7b/step_10000 \
--eval.dataset_path=/data/eval_set
scripts/eval.py
is the training script minus the optimizer:
Same
load_config+init_distributed+build_parallel_modelpath, so FSDP / TP configurations work unchangedLoads the checkpoint via
CheckpointManager(...)withexclude_keys=["optimizer"]— model + RNG onlyRuns
run_eval(model, dataloader, loss_fn, device, eval_steps)and prints results to stdout + JSON
Output:
==================================================
Evaluation Results (step 10000)
==================================================
eval/loss: 2.4321
eval/perplexity: 11.3856
==================================================
{"step": 10000, "tokens_seen": 4194304000, "eval/loss": 2.4321, ...}
The JSON dump at the end is useful for scripting — redirect to a file and diff across checkpoints.
HuggingFace eval data path¶
uv run python scripts/eval.py configs/train/7b.toml \
--checkpoint.load_path=checkpoints/7b/step_10000 \
--eval.hf_dataset_name=wikitext \
--eval.hf_dataset_config=wikitext-103-raw-v1
Rank 0 tokenizes the full eval split (via
HuggingFaceDataset)
and broadcasts packed sequences to all other ranks — fine for
benchmark-sized eval sets, avoid for anything multi-GB.
Pipeline parallelism is not supported here¶
The standalone script’s model build goes through build_parallel_model,
not the PP stage builder. If you need PP eval, drive it from the
training loop (Path 1).
Path 3: lm-eval-harness for downstream tasks¶
The training-loss eval is useful for training signal; for downstream benchmarks you want lm-eval-harness.
# Install the extra (lm-eval is optional — not default dep)
uv add lm-eval
# Run default task suite: hellaswag, arc_easy, arc_challenge,
# winogrande, piqa, boolq
uv run python scripts/eval_harness.py \
--checkpoint checkpoints/7b/step_10000 \
--config configs/train/7b.toml
# Specific tasks
uv run python scripts/eval_harness.py \
--checkpoint checkpoints/7b/step_10000 \
--config configs/train/7b.toml \
--tasks hellaswag,mmlu,arc_easy
# Pre-converted HF model — skip DCP conversion
uv run python scripts/eval_harness.py \
--hf-model ./exports/my_model \
--tasks hellaswag
scripts/eval_harness.py
does three things:
Converts the DCP checkpoint to HuggingFace format via
dcp_to_hf()fromscripts/convert_checkpoint.py, into a tempdirCalls
lm_eval.simple_evaluate(model="hf", model_args=...)with the task listPrints results and optionally writes JSON via
--output
Flags¶
Flag |
Default |
Purpose |
|---|---|---|
|
— |
DCP checkpoint dir (will be converted) |
|
— |
TOML (required with |
|
— |
Pre-converted HF dir, skip conversion |
|
|
comma-sep tasks |
|
|
eval batch size |
|
|
override few-shot count |
|
|
save full JSON results |
Conversion caveat¶
DCP → HF conversion is a one-time cost per checkpoint — the full
model is materialized on one device, keys are remapped, and weights
are written as safetensors. Time scales with model size and filesystem
throughput, so larger checkpoints on networked storage take longer. If
you’re scanning many checkpoints with the harness, convert once
manually and re-use the output with --hf-model:
uv run python scripts/convert_checkpoint.py dcp-to-hf \
--dcp-dir checkpoints/7b/step_10000 \
--hf-dir exports/7b_step_10000 \
--config configs/model/llama_7b.toml
uv run python scripts/eval_harness.py \
--hf-model exports/7b_step_10000 \
--tasks hellaswag,mmlu
Picking a path¶
Goal |
Path |
|---|---|
Track training signal as loss curve |
Path 1 (in-training) |
Loss on a specific held-out set for a specific checkpoint |
Path 2 ( |
Downstream accuracy — HellaSwag / MMLU / ARC |
Path 3 ( |
Quick sanity on small datasets |
Path 2 |
Scanning 20 checkpoints for best MMLU |
Path 3 with pre-conversion |
The three paths don’t overlap in outputs: Path 1/2 report eval/loss
perplexity; Path 3 reports task-specific metrics (accuracy, acc-norm, pass@k). You’ll often run both — loss for the monitoring signal, harness for final reporting.
See also¶
Training § Evaluation —
run_evalimplementation notes.Configuration §
[eval]— everyEvalConfigfield with its default.Checkpointing § DCP model format — what the DCP → HF converter reads.
End-to-end training run — the source of the checkpoints the three paths consume.
scripts/eval.pyandscripts/eval_harness.py— the scripts this page documents.