kempnerforge.resilience.elastic

Elastic training and SLURM integration helpers.

Provides utilities for training jobs that may be preempted, requeued, or restarted with a different number of nodes:

  • SLURM job info detection

  • Requeue detection

  • Auto-resume path resolution

Functions

get_slurm_info()

Read SLURM job information from environment variables.

is_slurm_job()

Check if we are running under SLURM.

is_slurm_requeue()

Check if this is a requeued SLURM job.

log_job_info()

Log SLURM job information (if running under SLURM).

resolve_resume_path(checkpoint_dir)

Find the latest checkpoint for auto-resume.

Classes

SLURMInfo

Information about the current SLURM job.

class kempnerforge.resilience.elastic.SLURMInfo[source]

Bases: object

Information about the current SLURM job.

job_id: str
job_name: str
node_list: str
num_nodes: int
ntasks_per_node: int
restart_count: int
partition: str
array_task_id: str | None
property is_requeued: bool

Whether this job has been requeued (restart_count > 0).

__init__(job_id, job_name, node_list, num_nodes, ntasks_per_node, restart_count, partition, array_task_id)
Parameters:
  • job_id (str)

  • job_name (str)

  • node_list (str)

  • num_nodes (int)

  • ntasks_per_node (int)

  • restart_count (int)

  • partition (str)

  • array_task_id (str | None)

Return type:

None

kempnerforge.resilience.elastic.get_slurm_info()[source]

Read SLURM job information from environment variables.

Returns:

SLURMInfo if running under SLURM, None otherwise.

Return type:

SLURMInfo | None

kempnerforge.resilience.elastic.is_slurm_job()[source]

Check if we are running under SLURM.

Return type:

bool

kempnerforge.resilience.elastic.is_slurm_requeue()[source]

Check if this is a requeued SLURM job.

Uses SLURM_RESTART_COUNT (set by SLURM on requeue).

Return type:

bool

kempnerforge.resilience.elastic.resolve_resume_path(checkpoint_dir)[source]

Find the latest checkpoint for auto-resume.

Checks:
  1. {checkpoint_dir}/latest symlink

  2. Most recent step_N directory by step number

Parameters:

checkpoint_dir (str) – Base checkpoint directory.

Returns:

Path to the latest checkpoint, or None if none found.

Return type:

Path | None

kempnerforge.resilience.elastic.log_job_info()[source]

Log SLURM job information (if running under SLURM).

Return type:

None