kempnerforge.checkpoint.async_save

Async checkpointing for non-blocking saves.

Uses dcp.async_save() to snapshot state to CPU and write to disk in the background, returning control to the training loop immediately.

Modes:
  • disabled: Synchronous save (simple, for debugging).

  • async: Standard async via dcp.async_save().

  • async_with_pinned_mem: Async with pinned memory staging for faster GPU→CPU.

Classes

AsyncCheckpointer

Non-blocking checkpoint saver.

class kempnerforge.checkpoint.async_save.AsyncCheckpointer[source]

Bases: object

Non-blocking checkpoint saver.

Wraps dcp.async_save() and manages the background save future. Each new save waits for the previous async save to complete first.

Parameters:

mode – Checkpoint mode (disabled/async/async_with_pinned_mem).

__init__(mode=AsyncCheckpointMode.disabled)[source]
Parameters:

mode (AsyncCheckpointMode)

Return type:

None

save(state_dict, checkpoint_id, process_group=None)[source]

Save distributed state, potentially asynchronously.

Parameters:
  • state_dict (dict) – DCP-compatible state dict (model + optimizer).

  • checkpoint_id (str) – Checkpoint directory path.

  • process_group – Process group for DCP. Required for PP where each stage has a different state dict — pass a group scoped to ranks within the same PP stage. None uses the default global group.

Return type:

None

wait()[source]

Block until any pending async save completes.

Return type:

None

property is_pending: bool

Check if an async save is still in progress.