Distributed

The parallelism families — FSDP2, tensor parallelism, expert parallelism, pipeline parallelism, FP8 — and the DeviceMesh that composes them.

What lives where

Page

Covers

Device mesh

Mesh construction, dimension order, sub-mesh extraction (get_dp_mesh / get_tp_mesh / get_pp_mesh), example meshes from shipped configs

FSDP2

Composable fully_shard(), per-block + top-level wrap, mixed-precision policy, reshard_after_forward, EP-MoE per-sub-module wrapping, activation checkpointing modes

Tensor parallelism

ColwiseParallel / RowwiseParallel / SequenceParallel plan, forward / post-hooks on embeddings + attention + MLP, meta-device init, head divisibility

Expert parallelism

All-to-all dispatch + combine, unused-expert grad kludge, EP + TP + FSDP composition, gradient scaling

Pipeline parallelism

Layer assignment, PipelineStageModule, 1F1B / GPipe / interleaved schedules, per-stage DCP checkpointing

FP8

torchao.float8 conversion, filter rules (no experts / router), enable_fsdp_float8_all_gather, TP incompatibility

Cross-cutting references

Not covered here

  • NCCL health / liveness — the all-reduce heartbeat and NaN detection are in Resilience.

  • Checkpointing under PP / TP / FSDP — mesh-scoped DCP groups and resharding are in Checkpointing.

  • Context parallelismcp is declared in config and checked in validation but has no apply_context_parallel yet; see Device mesh § Context parallelism.