kempnerforge.checkpoint.async_save¶

Async checkpointing for non-blocking saves.

Uses dcp.async_save() to snapshot state to CPU and write to disk in the background, returning control to the training loop immediately.

Modes:

Classes

Non-blocking checkpoint saver.

class kempnerforge.checkpoint.async_save.AsyncCheckpointer[source]¶

Non-blocking checkpoint saver.

Wraps dcp.async_save() and manages the background save future. Each new save waits for the previous async save to complete first.

Parameters:: mode – Checkpoint mode (disabled/async/async_with_pinned_mem).

__init__(mode=AsyncCheckpointMode.disabled)[source]¶

save(state_dict, checkpoint_id, process_group=None)[source]¶

Save distributed state, potentially asynchronously.

Parameters:

state_dict (dict) – DCP-compatible state dict (model + optimizer).
checkpoint_id (str) – Checkpoint directory path.
process_group – Process group for DCP. Required for PP where each stage has a different state dict — pass a group scoped to ranks within the same PP stage. None uses the default global group.

Return type:

None

Block until any pending async save completes.