Checkpointing¶
Distributed checkpoints via torch.distributed.checkpoint (DCP):
what’s saved, how resharding works, auto-resume rules, and
HuggingFace interchange.
At a glance¶
Every checkpoint lands in {config.checkpoint.dir}/step_{N}/ and
contains two kinds of state:
File(s) |
Contents |
Format |
|---|---|---|
DCP shards ( |
Model + optimizer state, one shard per rank |
|
|
step, tokens_seen, scheduler, RNG, extras (e.g. |
|
|
Human-readable |
Plain JSON |
|
Points at the most recent |
Key modules¶
kempnerforge/checkpoint/manager.py—CheckpointManager.save()/load()/wait(),latestsymlink maintenance, retention cleanup.kempnerforge/checkpoint/async_save.py—AsyncCheckpointer: sync / async / pinned-memory modes.kempnerforge/checkpoint/state.py—build_train_state/restore_train_state, RNG capture.kempnerforge/resilience/elastic.py—resolve_resume_path()(checks thelatestsymlink and falls back to the higheststep_N).scripts/convert_checkpoint.py—dcp-to-hfandhf-to-dcpCLI.
Read next¶
New reader: DCP model + optimizer → Train state.
Resuming a job: Auto-resume first, then Resharding if you’re changing GPU count.
Exporting for inference or HF checkpoints: HF conversion.
Config knobs: Configuration § CheckpointConfig (search for
interval,async_mode,keep_last_n,load_path).