Parallelism recipes¶
Every recipe here corresponds to a TOML preset we’ve actually run, not
a theoretical configuration. Factors multiply to the given GPU count:
dp_replicate × dp_shard × tp × pp × cp × ep == world_size (see
Validation rules § Parallelism arithmetic).
Single-node = 4 H200 141GB per node. Intra-node uses NVLink; inter-node uses InfiniBand.
7B (Llama-3)¶
GPUs |
Nodes |
Parallelism |
Config |
When to pick this |
|---|---|---|---|---|
1 |
1 |
single GPU |
Smoke test; |
|
4 |
1 |
FSDP=4 |
|
Single-node baseline; highest MFU |
12 |
3 |
TP=4 × FSDP=3 |
TP within node, FSDP across — uncommon GPU count |
|
16 |
4 |
FSDP=16 |
Long preemptible runs ( |
|
16 |
4 |
FSDP=16, FP8 |
FP8 compute with bf16 masters; |
|
16 |
4 |
FSDP=16, Muon |
Muon + z-loss + chunked CE integration test |
|
32 |
8 |
FSDP=32 |
Simple multi-node baseline |
The 7B model underutilizes 32 H200s (27.6 GB / 140 GB in the benchmark) — at that scale 13B delivers higher MFU.
13B (Llama-3)¶
GPUs |
Nodes |
Parallelism |
Config |
When to pick this |
|---|---|---|---|---|
8 |
2 |
FSDP=8 |
(inline override) |
Pure FSDP when 13B params fit ≤90 GB/GPU |
24 |
6 |
TP=4 × FSDP=6 |
Full validation: WandB + profiling + eval |
|
32 |
8 |
TP=4 × PP=2 × FSDP=4 |
Pipeline parallel recipe — 20 layers per stage |
29B (custom)¶
GPUs |
Nodes |
Parallelism |
Config |
When to pick this |
|---|---|---|---|---|
32 |
8 |
TP=4 × PP=2 × FSDP=4 |
|
70B (Llama-3)¶
GPUs |
Nodes |
Parallelism |
Config |
When to pick this |
|---|---|---|---|---|
32 |
8 |
TP=4 × FSDP=8 |
No PP — avoids bubble when FSDP alone fits |
|
32 |
8 |
TP=4 × PP=4 × FSDP=2 |
Memory-tight alternative — 20 layers per PP stage |
70b_32gpu_tp4.toml is the preferred of the two when memory allows:
the PP version adds a pipeline bubble in exchange for less FSDP
sharding work. See
Validation rules § Sequence length
and
§ Tensor-parallel head divisibility
— both 70B recipes rely on n_heads=64 % tp=4 == 0 and
n_kv_heads=8 % tp=4 == 0.
MoE (Mixtral-style, 8 experts top-2)¶
GPUs |
Nodes |
Parallelism |
Config |
When to pick this |
|---|---|---|---|---|
1+ |
1+ |
FSDP |
Smoke test; tiny MoE, |
|
8 |
2 |
TP=4 × FSDP=2 |
Real dataset, full AC, saturates 2 nodes |
|
24 |
6 |
TP=4 × FSDP=6 |
|
|
32 |
8 |
TP=4 × EP=2 × FSDP=4 |
Expert Parallel — all-to-all across IB |
PP=1 is mandatory for MoE (routing is data-dependent and resists
pipeline-stage splitting — see the cross-section rules in
Validation rules).
ep > 1 requires num_experts > 0 and
num_experts % ep == 0 — see
§ Expert parallel.
Choosing a parallelism combination¶
The order the cross-section validator enforces (see Parallelism order) is:
PP — split layers across stages; set first because it rewrites the model into stage submodules.
TP — shard attention heads and MLP within each PP stage; needs
n_heads % tp == 0,n_kv_heads % tp == 0.CP — split along the sequence dimension (not wired up yet).
EP — shard experts across ranks (MoE only).
FSDP (
dp_shard) — shard remaining params; the “fill the rest” dimension, usuallydp_shard=-1.DP replicate (
dp_replicate) — multiple full copies, typically1until a recipe deliberately trades memory for smaller all-reduce domains.
Rules of thumb from the benchmarks:
FSDP-only wins when it fits. At 4 GPUs / 7B, pure FSDP beats
tp=4by ~18 MFU points; TP all-gather/reduce-scatter on every matmul dwarfs FSDP’s once-per-step gradient communication.TP when memory forces it or
n_layersblocks FSDP. 70B at 32 GPUs needs TP — pure FSDP=32 OOMs on activation memory. TP within a node (NVLink) is cheap; across nodes (IB) is slow.PP when even TP+FSDP can’t fit. 70B with PP=4 fits with half the FSDP shards (
dp_shard=2vsdp_shard=8) at the cost of the PP bubble; only pick it when memory requires.EP when expert count outgrows a single rank. At 8 experts and 32 GPUs, EP=2 halves per-rank expert storage. At higher expert counts (64+, DeepSeekMoE-scale) EP becomes load-bearing.
See also¶
Available configs — the same recipes, indexed by filename.
Benchmarks — measured throughput and MFU for the dense 7B/13B/70B and MoE EP recipes above.
Architecture § Parallelism order — the invariants the cross-section validator enforces.
Validation rules — the precise checks
JobConfig.validate(world_size)runs against these combinations.