Auto-resume¶
KempnerForge resumes a training run automatically on restart — no flag to pass, no manual path to point at. Every SLURM requeue, every preemption-and-restart, every “I killed the job and started it again” just works, as long as a checkpoint directory exists.
The resolution order¶
# kempnerforge/resilience/elastic.py (abridged — logging calls elided)
def resolve_resume_path(checkpoint_dir: str) -> Path | None:
base = Path(checkpoint_dir)
if not base.exists():
return None
# 1. latest symlink
latest = base / "latest"
if latest.exists():
resolved = latest.resolve()
if resolved.exists():
return resolved
# 2. highest-numbered step_N directory
step_dirs = sorted(
(d for d in base.iterdir()
if d.is_dir() and d.name.startswith("step_") and d.name.split("_")[1].isdigit()),
key=lambda d: int(d.name.split("_")[1]),
)
if step_dirs:
return step_dirs[-1]
return None
Two fallbacks:
latestsymlink — the canonical pointer, updated atomically after every successful save (see Symlink updates below). Used in all healthy cases.Highest
step_Ndirectory — a safety net. If the symlink is missing (disk corruption, manualrm, never created because no save has completed), the resolver scans for the highest-numberedstep_Ndirectory and uses that.
If neither path finds anything, the function returns None and
training starts from step 0.
Where it’s called¶
scripts/train.py calls this once, right after creating the
CheckpointManager:
resume_path = resolve_resume_path(config.checkpoint.dir)
step, tokens_seen = 0, 0
if resume_path or config.checkpoint.load_path:
step, tokens_seen, ckpt_extra_loaded = ckpt_mgr.load(
path=str(resume_path) if resume_path else None,
scheduler=scheduler,
)
if ckpt_extra_loaded.get("wandb_run_id"):
config.metrics.wandb_run_id = ckpt_extra_loaded["wandb_run_id"]
So two things can trigger a resume:
resume_pathfound — auto-resume; user passed nothing.config.checkpoint.load_pathset — explicit override; skip the symlink lookup and load from that path. Useful for loading pretrained weights, fine-tuning from a specific checkpoint, or debugging.
If both are set, resume_path wins (auto-resume takes precedence
over the static config). This is deliberate: SLURM requeues should
always pick up where they left off, not re-load the initial
load_path every time.
Symlink updates¶
Inside CheckpointManager.save (rank 0 only):
# manager.py
latest = self._latest_link() # <dir>/latest
tmp_link = latest.with_suffix(".tmp") # <dir>/latest.tmp
tmp_link.unlink(missing_ok=True)
tmp_link.symlink_to(ckpt_dir.name) # relative link: "step_1000"
tmp_link.rename(latest) # atomic rename
Two details that matter:
Relative target — the symlink points at
step_1000, not/abs/path/checkpoints/step_1000. Checkpoint directories stay portable when moved or bind-mounted.Atomic rename —
tmp_link.rename(latest)is a single atomic syscall (rename(2)on POSIX). Either the old symlink is still there or the new one is — never a half-written state. Safe against crashes mid-save.
The symlink is updated after the DCP save completes (async save
futures are resolved before the next save starts), so latest only
ever points at a fully-flushed checkpoint.
Retention cleanup¶
After updating the symlink, CheckpointManager._cleanup() trims the
oldest checkpoints beyond config.checkpoint.keep_last_n:
ckpt_dirs = sorted((d for d in self.base_dir.iterdir()
if d.is_dir() and d.name.startswith("step_")),
key=lambda d: int(d.name.split("_")[1]))
to_remove = ckpt_dirs[:-keep] if len(ckpt_dirs) > keep else []
for d in to_remove:
shutil.rmtree(d)
Default keep_last_n = 3. The latest symlink always points at the
newest, never at one scheduled for removal. If you want to keep
everything, set keep_last_n to a large number — there’s no
“disable cleanup” flag (and the __post_init__ check requires
keep_last_n >= 1).
Edge cases¶
Empty checkpoint directory —
resolve_resume_pathreturnsNone, training starts from step 0. No error.latestsymlink points at a removed directory — the resolver checksresolved.exists()after following the link; if it doesn’t, it falls through to the highest-step_Nscan. This catches the case where someone rm-rf’d a checkpoint but left the symlink.Corrupted
step_Ndirectory —resolve_resume_pathdoesn’t verify the contents. If DCP can’t load from the resolved path,ckpt_mgr.loadraises and training aborts. In practice, async saves either complete or leave a truncated directory that DCP detects and errors on clearly.Fresh checkpoint dir,
load_pathset —load_pathis used (the else branch). Training starts from the step in that checkpoint.Multi-node NFS vs local disk — the
latestsymlink lives on disk along with the shards. For a shared filesystem (Lustre, NFS) this just works. For local scratch, every node needs its own copy of the checkpoint directory, which KempnerForge does not manage automatically — use a shared filesystem.
Loading a specific step¶
To rewind to an earlier checkpoint manually:
# Delete the symlink and let resolve_resume_path pick up step_5000
cd checkpoints
rm latest
ln -s step_5000 latest
Or override explicitly:
[checkpoint]
load_path = "checkpoints/step_5000"
The explicit override skips the symlink; SLURM requeues will still
try auto-resume first (finding nothing newer than step 5000, they’ll
fall through to load_path).
Resilience interaction¶
Auto-resume pairs with the SLURM preemption handler in
kempnerforge/resilience/:
SLURM sends
SIGTERMon preemption.SignalHandlerflags shutdown; the training loop finishes the current step and saves an emergency checkpoint.ckpt_mgr.wait()flushes any async save beforedestroy_distributed().SLURM requeues the job.
New job starts,
resolve_resume_pathfindslatest, training resumes from the emergency checkpoint.
See also¶
DCP model + optimizer — the checkpoint format that
ckpt_mgr.loadconsumes.Train state — what gets restored alongside the model weights.
Resharding — what auto-resume does when the GPU count changes between save and load.
Configuration § CheckpointConfig —
load_path,keep_last_n,dir.Resilience — the preemption handler that drives the emergency save.