SLURM elastic helpers

kempnerforge/resilience/elastic.py is a small pile of SLURM-environment helpers. Most of these are read-only introspection — they pull variables out of the environment and return typed data. The one with teeth is resolve_resume_path, documented on its own page; this page covers the rest.

SLURMInfo and get_slurm_info

@dataclass
class SLURMInfo:
    job_id: str
    job_name: str
    node_list: str
    num_nodes: int
    ntasks_per_node: int
    restart_count: int
    partition: str
    array_task_id: str | None

    @property
    def is_requeued(self) -> bool:
        return self.restart_count > 0

get_slurm_info() reads these from env vars and returns None if not running under SLURM (i.e. SLURM_JOB_ID is unset):

# kempnerforge/resilience/elastic.py — get_slurm_info
SLURMInfo(
    job_id          = os.environ["SLURM_JOB_ID"],
    job_name        = os.environ.get("SLURM_JOB_NAME", ""),
    node_list       = os.environ.get("SLURM_JOB_NODELIST", ""),
    num_nodes       = int(os.environ.get("SLURM_NNODES", "1")),
    ntasks_per_node = int(os.environ.get("SLURM_NTASKS_PER_NODE", "1")),
    restart_count   = int(os.environ.get("SLURM_RESTART_COUNT", "0")),
    partition       = os.environ.get("SLURM_JOB_PARTITION", ""),
    array_task_id   = os.environ.get("SLURM_ARRAY_TASK_ID"),
)

Use it when you want to branch on SLURM-only behavior:

from kempnerforge.resilience import get_slurm_info

info = get_slurm_info()
if info is not None and info.is_requeued:
    logger.info(f"Detected requeue #{info.restart_count}")

is_slurm_job / is_slurm_requeue

Convenience booleans for the same data:

is_slurm_job()     # "SLURM_JOB_ID" in os.environ
is_slurm_requeue() # int(os.environ.get("SLURM_RESTART_COUNT", "0")) > 0

Neither calls get_slurm_info — they just check the relevant env var directly, so they’re cheap to call frequently.

log_job_info

Called once during training startup:

# scripts/train.py
log_job_info()

Emits one info-level log line with the SLURM job details, plus a second line if this is a requeue:

SLURM job: id=123456, name=kf-7b-adamw, nodes=4, tasks/node=4,
           partition=h200_preempt, restart_count=2
Job was requeued (restart #2) — will auto-resume

If not running under SLURM it logs "Not running under SLURM" and returns.

resolve_resume_path

Covered on its own page — see Checkpointing § Auto-resume. This is the function that decides whether the training run loads an existing checkpoint at startup.

Patterns these enable

“First run vs requeue” branching

info = get_slurm_info()
if info and info.is_requeued:
    # Skip LR warmup on requeue (schedule state restored from checkpoint)
    scheduler_state = "restored"
else:
    # Fresh run
    scheduler_state = "fresh"

In practice you don’t need this — scripts/train.py handles warmup restoration via the scheduler’s own state_dict. The branch is useful for one-off behaviors (e.g. “only print config diff on first run”).

Array job sharding

info = get_slurm_info()
if info and info.array_task_id is not None:
    shard = int(info.array_task_id)
    # Use shard to pick a config variant, data shard, etc.

scripts/train.py doesn’t read array_task_id — wire it through CLI overrides if you want per-array-task config variants.

See also