Distributed¶
The parallelism families — FSDP2, tensor parallelism, expert
parallelism, pipeline parallelism, FP8 — and the DeviceMesh that
composes them.
What lives where¶
Page |
Covers |
|---|---|
Mesh construction, dimension order, sub-mesh extraction ( |
|
Composable |
|
|
|
All-to-all dispatch + combine, unused-expert grad kludge, EP + TP + FSDP composition, gradient scaling |
|
Layer assignment, |
|
|
Cross-cutting references¶
Architecture § Parallelism order — the five-step apply sequence and the reasoning behind it.
Configuration § DistributedConfig — the dataclass with
dp_replicate,dp_shard,tp,pp,cp,ep,pp_schedule.Configuration § Validation rules — arithmetic checks, head divisibility, FP8 + TP, MoE + PP, tie-embeddings + PP.
Reference § Parallelism recipes — which combinations work at which scales.
Reference § Benchmarks — measured MFU and the MoE per-sub-module FSDP fix.
Not covered here¶
NCCL health / liveness — the all-reduce heartbeat and NaN detection are in Resilience.
Checkpointing under PP / TP / FSDP — mesh-scoped DCP groups and resharding are in Checkpointing.
Context parallelism —
cpis declared in config and checked in validation but has noapply_context_parallelyet; see Device mesh § Context parallelism.