Environment variables¶
Every environment variable KempnerForge reads, grouped by source.
Authoritative locations in the tree:
kempnerforge/distributed/setup.py,
kempnerforge/resilience/elastic.py,
kempnerforge/metrics/logger.py,
and the helper launch scripts under
scripts/slurm/.
Rank / world-size (torchrun or SLURM)¶
Read by get_world_info() in
kempnerforge/distributed/setup.py.
Each torchrun-style var falls back to the SLURM equivalent, so the
same entry point works under both launchers.
Variable |
Fallback |
Purpose |
|---|---|---|
|
|
Global rank (0..world_size-1) |
|
|
Rank within a node (0..gpus_per_node-1) |
|
|
Total number of ranks |
get_world_info() reads whichever is set and then calls
os.environ.setdefault(...) on all three, so downstream code
(PyTorch, WandB, logging) sees a torchrun-shaped environment even on
srun-direct launches.
Rendezvous (MASTER_ADDR / MASTER_PORT)¶
Variable |
Set by |
Notes |
|---|---|---|
|
user / launch script / |
If unset, |
|
user / launch script / |
If unset, derived from |
The SLURM launch helpers
(scripts/slurm/multinode.sh,
_run_training.sh)
export both before srun so every rank agrees. init_distributed()
only fills in missing values — it never overwrites.
NCCL / Gloo (auto-set by _set_nccl_env())¶
init_distributed() calls _set_nccl_env(), which detects the IB
interface (ls /sys/class/net | grep '^ib') and populates defaults:
Variable |
Default |
Purpose |
|---|---|---|
|
detected IB interface (e.g. |
NCCL bootstrap transport |
|
same as above |
Gloo (used by DCP async checkpoint coordination) |
|
|
Keep InfiniBand RDMA enabled |
|
|
GPUDirect RDMA level — enables GPU→NIC DMA |
All four use setdefault, so anything you export before launch wins.
The launch scripts also set NCCL_IB_GID_INDEX=3 and
NCCL_TIMEOUT=1800; those are shell-level conventions, not read
anywhere in the Python code.
SLURM metadata (resilience)¶
Read by get_slurm_context() and running_under_slurm() in
kempnerforge/resilience/elastic.py.
Purely informational — used to populate logs and decide whether a job
is a restart.
Variable |
Purpose |
|---|---|
|
Job identifier; also used to derive |
|
Job name (logged) |
|
Nodelist; parsed by |
|
Node count (logged) |
|
Tasks per node (logged) |
|
Non-zero means this is a requeue (used to detect restarts) |
|
Partition name (logged) |
|
Array index when applicable (logged) |
Logging¶
Variable |
Read in |
Purpose |
|---|---|---|
|
Disables ANSI color codes in logs when set to any truthy value |
|
|
same |
Used to prefix log lines with the current rank |
User-facing launch-script variables¶
Read only by the SLURM helpers under
scripts/slurm/,
not by any Python module:
Variable |
Default |
Purpose |
|---|---|---|
|
|
Per-rank log destination for |
|
|
IB interface for the NCCL/Gloo exports in the launch scripts |
|
(unset) |
Passed through in |
Who sets what¶
Source |
Variables it populates |
|---|---|
User / job script |
|
|
|
|
|
|
Exports |
|
Fills missing |
See also¶
SLURM launch scripts — how
multinode.sh,_run_training.sh, and friends assemble these variables end-to-end.Architecture § One-slide overview — where
RANK/LOCAL_RANK/WORLD_SIZEfeed into the mesh construction.Benchmarks § MoE Expert Parallelism — where
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueis load-bearing.