kempnerforge.checkpoint.async_save¶
Async checkpointing for non-blocking saves.
Uses dcp.async_save() to snapshot state to CPU and write to disk
in the background, returning control to the training loop immediately.
- Modes:
disabled: Synchronous save (simple, for debugging).
async: Standard async via dcp.async_save().
async_with_pinned_mem: Async with pinned memory staging for faster GPU→CPU.
Classes
Non-blocking checkpoint saver. |
- class kempnerforge.checkpoint.async_save.AsyncCheckpointer[source]¶
Bases:
objectNon-blocking checkpoint saver.
Wraps
dcp.async_save()and manages the background save future. Each new save waits for the previous async save to complete first.- Parameters:
mode – Checkpoint mode (disabled/async/async_with_pinned_mem).
- __init__(mode=AsyncCheckpointMode.disabled)[source]¶
- Parameters:
mode (AsyncCheckpointMode)
- Return type:
None
- save(state_dict, checkpoint_id, process_group=None)[source]¶
Save distributed state, potentially asynchronously.
- Parameters:
state_dict (dict) – DCP-compatible state dict (model + optimizer).
checkpoint_id (str) – Checkpoint directory path.
process_group – Process group for DCP. Required for PP where each stage has a different state dict — pass a group scoped to ranks within the same PP stage. None uses the default global group.
- Return type:
None