Run evaluation¶

Three entry points, three use cases:

Entry point	When to use
`[eval]` section in training config	Periodic eval loss during a training run
`scripts/eval.py`	Standalone eval on a saved checkpoint — quick loss/perplexity on new data
`scripts/eval_harness.py`	Downstream benchmarks (HellaSwag, ARC, MMLU, …) via lm-eval-harness

All three compute through the same underlying run_eval function when they need loss; the harness path additionally converts the DCP checkpoint to HuggingFace format and hands off to lm_eval.

Path 1: eval inside the training loop¶

Flip on [eval] in your TOML:

[eval]
enabled              = true
interval             = 500          # steps between evals
steps                = 50           # batches per eval
dataset_path         = "/data/eval_set"
file_pattern         = "*.bin"
# OR — HuggingFace eval data:
# hf_dataset_name    = "wikitext"
# hf_dataset_config  = "wikitext-103-raw-v1"
# hf_dataset_split   = "validation"

Every interval steps, the training loop pauses the optimizer, switches the model to eval mode, iterates steps batches of held-out data, and logs eval/loss and eval/perplexity through MetricsTracker.log_eval. The stdout backend renders one compact line per call:

[step 500] eval/loss=3.8200 | eval/perplexity=45.6000

The eval dataloader is a plain DataLoader with a DistributedSampler (not StatefulDataLoader). It’s deterministic because it doesn’t shuffle, and it’s allowed to reset on epoch boundary — eval iteration uses StopIteration → re-init to wrap around if steps exceeds the dataset.

Perplexity is capped¶

{"eval/loss": avg_loss, "eval/perplexity": math.exp(min(avg_loss, 20.0))}

exp(20) ≈ 4.85e8. When the model hasn’t converged yet and loss is huge, exp would overflow — so the clamp keeps the number finite. Treat perplexity ≈ 5e8 as “loss is still blowing up,” not a real perplexity.

PP integration¶

run_eval auto-detects pipeline parallelism via its pp_schedule argument. On PP stages, eval runs through the same schedule.step(input, target, losses) machinery as training; the last stage broadcasts the loss back to rank 0 so logging stays consistent.

Path 2: standalone eval on a checkpoint¶

# Single GPU
uv run python scripts/eval.py configs/train/7b.toml \
    --checkpoint.load_path=checkpoints/7b/step_10000 \
    --eval.dataset_path=/data/eval_set \
    --eval.steps=100

# Multi-GPU (FSDP for large models)
uv run torchrun --nproc_per_node=4 scripts/eval.py configs/train/7b.toml \
    --checkpoint.load_path=checkpoints/7b/step_10000 \
    --eval.dataset_path=/data/eval_set

scripts/eval.py is the training script minus the optimizer:

Same load_config + init_distributed + build_parallel_model path, so FSDP / TP configurations work unchanged
Loads the checkpoint via CheckpointManager(...) with exclude_keys=["optimizer"] — model + RNG only
Runs run_eval(model, dataloader, loss_fn, device, eval_steps) and prints results to stdout + JSON

Output:

==================================================
Evaluation Results (step 10000)
==================================================
  eval/loss: 2.4321
  eval/perplexity: 11.3856
==================================================

{"step": 10000, "tokens_seen": 4194304000, "eval/loss": 2.4321, ...}

The JSON dump at the end is useful for scripting — redirect to a file and diff across checkpoints.

HuggingFace eval data path¶

uv run python scripts/eval.py configs/train/7b.toml \
    --checkpoint.load_path=checkpoints/7b/step_10000 \
    --eval.hf_dataset_name=wikitext \
    --eval.hf_dataset_config=wikitext-103-raw-v1

Rank 0 tokenizes the full eval split (via HuggingFaceDataset) and broadcasts packed sequences to all other ranks — fine for benchmark-sized eval sets, avoid for anything multi-GB.

Pipeline parallelism is not supported here¶

The standalone script’s model build goes through build_parallel_model, not the PP stage builder. If you need PP eval, drive it from the training loop (Path 1).

Path 3: lm-eval-harness for downstream tasks¶

The training-loss eval is useful for training signal; for downstream benchmarks you want lm-eval-harness.

# Install the extra (lm-eval is optional — not default dep)
uv add lm-eval

# Run default task suite: hellaswag, arc_easy, arc_challenge,
# winogrande, piqa, boolq
uv run python scripts/eval_harness.py \
    --checkpoint checkpoints/7b/step_10000 \
    --config    configs/train/7b.toml

# Specific tasks
uv run python scripts/eval_harness.py \
    --checkpoint checkpoints/7b/step_10000 \
    --config    configs/train/7b.toml \
    --tasks     hellaswag,mmlu,arc_easy

# Pre-converted HF model — skip DCP conversion
uv run python scripts/eval_harness.py \
    --hf-model ./exports/my_model \
    --tasks    hellaswag

scripts/eval_harness.py does three things:

Converts the DCP checkpoint to HuggingFace format via dcp_to_hf() from scripts/convert_checkpoint.py, into a tempdir
Calls lm_eval.simple_evaluate(model="hf", model_args=...) with the task list
Prints results and optionally writes JSON via --output

Flags¶

Flag	Default	Purpose
`--checkpoint`	—	DCP checkpoint dir (will be converted)
`--config`	—	TOML (required with `--checkpoint` to resolve model architecture)
`--hf-model`	—	Pre-converted HF dir, skip conversion
`--tasks`	`hellaswag,arc_easy,arc_challenge,winogrande,piqa,boolq`	comma-sep tasks
`--batch-size`	`8`	eval batch size
`--num-fewshot`	`None` (task default)	override few-shot count
`--output`	`None`	save full JSON results

Conversion caveat¶

DCP → HF conversion is a one-time cost per checkpoint — the full model is materialized on one device, keys are remapped, and weights are written as safetensors. Time scales with model size and filesystem throughput, so larger checkpoints on networked storage take longer. If you’re scanning many checkpoints with the harness, convert once manually and re-use the output with --hf-model:

uv run python scripts/convert_checkpoint.py dcp-to-hf \
    --dcp-dir checkpoints/7b/step_10000 \
    --hf-dir  exports/7b_step_10000 \
    --config  configs/model/llama_7b.toml

uv run python scripts/eval_harness.py \
    --hf-model exports/7b_step_10000 \
    --tasks hellaswag,mmlu

Picking a path¶

Goal	Path
Track training signal as loss curve	Path 1 (in-training)
Loss on a specific held-out set for a specific checkpoint	Path 2 (`scripts/eval.py`)
Downstream accuracy — HellaSwag / MMLU / ARC	Path 3 (`scripts/eval_harness.py`)
Quick sanity on small datasets	Path 2
Scanning 20 checkpoints for best MMLU	Path 3 with pre-conversion

The three paths don’t overlap in outputs: Path 1/2 report eval/loss

perplexity; Path 3 reports task-specific metrics (accuracy, acc-norm, pass@k). You’ll often run both — loss for the monitoring signal, harness for final reporting.