kempnerforge.resilience.elastic¶
Elastic training and SLURM integration helpers.
Provides utilities for training jobs that may be preempted, requeued, or restarted with a different number of nodes:
SLURM job info detection
Requeue detection
Auto-resume path resolution
Functions
Read SLURM job information from environment variables. |
|
Check if we are running under SLURM. |
|
Check if this is a requeued SLURM job. |
|
Log SLURM job information (if running under SLURM). |
|
|
Find the latest checkpoint for auto-resume. |
Classes
Information about the current SLURM job. |
- class kempnerforge.resilience.elastic.SLURMInfo[source]¶
Bases:
objectInformation about the current SLURM job.
- kempnerforge.resilience.elastic.get_slurm_info()[source]¶
Read SLURM job information from environment variables.
- Returns:
SLURMInfo if running under SLURM, None otherwise.
- Return type:
SLURMInfo | None
- kempnerforge.resilience.elastic.is_slurm_job()[source]¶
Check if we are running under SLURM.
- Return type:
- kempnerforge.resilience.elastic.is_slurm_requeue()[source]¶
Check if this is a requeued SLURM job.
Uses
SLURM_RESTART_COUNT(set by SLURM on requeue).- Return type: