MoE experiments¶
End-to-end workflow for running Mixture-of-Experts models in KempnerForge: pick a router, stand up a baseline, tune the balance signals, diagnose hot/cold experts, turn on Expert Parallelism when memory demands it, and know where the composition rules stop you.
This guide assumes you’ve read MoE § overview. The subsystem pages cover the mechanics; this page covers the when and why.
Start from a working config¶
Two reference configs ship with the repo. Pick whichever matches the scale you’re iterating at:
# 1-GPU sanity check (~1 min): 4 experts, top-2, alternating MoE layers
uv run python scripts/train.py configs/train/debug_moe.toml
# 32-GPU production profile: 8 experts, top-2, EP=2 + TP=4 + FSDP=4
sbatch --nodes=8 scripts/slurm/multinode.sh configs/train/moe_ep_32gpu.toml
debug_moe.toml uses dim=256, n_layers=4, num_experts=4 — small
enough to run on a laptop-scale GPU in under a minute. Use it to
validate config changes before launching a real run.
Pick a router¶
[model]
num_experts = 8
moe_top_k = 2
moe_router = "softmax_topk" # or "sigmoid_topk"
The two routers have different balance strategies — start with
softmax_topk if you’re just getting a baseline up:
Goal |
Router |
Why |
|---|---|---|
Mixtral reproduction |
|
Matches original recipe, aux-loss coefficient well-studied (0.01) |
First MoE run, any target |
|
Simpler: one knob ( |
DeepSeek-V3 reproduction |
|
Bias-based balancer, matching paper |
Long runs where aux-loss coefficient is brittle |
|
Balance signal doesn’t perturb the main loss gradient |
See Routers for the mechanics.
Tune the aux loss¶
For softmax_topk, the coefficient moe_aux_loss_weight controls how
hard the router is pushed toward uniform:
[model]
moe_aux_loss_weight = 0.01 # Switch/Mixtral default — keep unless you see imbalance
0.01 (default): the Switch / Mixtral value. Start here.
0.1: over-regularizes — experts stay uniform but can’t specialize. Use only if you have a severe collapse problem and want to force exploration early.
0.001: under-regularizes. Balance drifts. Only use if you also turn on capacity factor or packed experts.
For sigmoid_topk, the coefficient has no effect by default
(aux_loss is 0). It only matters if you’ve enabled
moe_sequence_aux_loss_weight > 0; see
Aux loss § Sequence-level.
Diagnose hot/cold experts¶
The training loop logs two MoE-specific metrics when num_experts > 0:
moe/aux_loss # scalar aux loss this step (dispatches to WandB/TB)
moe/expert_balance # min/max across (layer × expert) — 1.0 = perfect uniform
Interpretation:
expert_balance > 0.5— fine. Balance is within 2× across experts; normal for softmax router.expert_balance < 0.2— one or more experts are getting ≥ 5× more tokens than the quietest. Router is specializing; check if intentional.expert_balance → 0— an expert got zero tokens this step. If it happens for one step, ignore. If it persists across many logging intervals, you have a dead expert.
For per-layer detail, call Transformer.get_expert_counts() —
returns dict[layer_idx → tensor(num_experts,)]. Useful to spot
which layer is collapsing:
# In a custom script or notebook
counts = model.get_expert_counts()
for layer_idx, c in counts.items():
pct = (c / c.sum() * 100).tolist()
print(f"layer {layer_idx}: {[f'{p:.1f}%' for p in pct]}")
A balanced 8-expert layer prints ~12.5% per expert. A collapsing one
prints something like [45%, 30%, 10%, 5%, 5%, 3%, 2%, 0%].
Recovery options, in order of what to try first:
Raise
moe_aux_loss_weightfrom 0.01 → 0.02 (softmax) or turn onmoe_sequence_aux_loss_weight = 0.001(sigmoid).Set
moe_capacity_factor = 1.25— caps hot experts at 25% above average, spilling overflow to residual.Enable
moe_gradient_scale = true— normalizes per-expert gradient magnitudes so cold experts aren’t starved of learning signal.Restart from a checkpoint before the collapse with a different seed. Balance issues that show up after many thousand steps often don’t reappear with a different initialization.
When to turn on EP¶
Expert Parallelism splits experts across ranks. It’s a memory-first feature: flip it on when experts dominate your parameter budget, not to go faster.
Rule of thumb — turn on EP when:
Expert weights > 50% of total model memory. For a 4B-total MoE with 8 experts and
moe_frequency=1, experts are typically 70-80% of parameters. EP=2 cuts that in half per rank.You’re OOMing with FSDP alone. FSDP shards by tensor, but experts live as
(E, dim, hidden)parameters (packed) or asModuleList. FSDP wraps them but can’t split across the expert dimension; EP does.Cross-node bandwidth is adequate. EP adds two all-to-alls per MoE layer (dispatch + combine). The measured 32-GPU EP=2 profile assumes InfiniBand; commodity Ethernet interconnects have not been benchmarked. See Benchmarks § MoE Expert Parallelism for measured numbers on H200 + IB.
Constraints:
num_expertsmust be divisible byep(validated inJobConfig.__post_init__).ep > 1withnum_experts == 0is rejected — EP requires MoE.Typical combinations:
ep=2for 8 experts,ep=4for 8-16 experts,ep=8for 32+ experts.
Minimal toggle:
[distributed]
tp = 4 # unchanged
ep = 2 # new
dp_shard = 4 # unchanged
See Expert parallelism for the all-to-all mechanics and throughput measurements.
Capacity factor: only when EP is on¶
moe_capacity_factor caps tokens per expert. It’s almost always the
wrong knob to turn on without EP:
Without EP: capacity factor drops tokens (residual still flows), which is wasted compute with no memory upside. Leave at
0.0.With EP: dispatch all-to-all buffers are sized for the worst-case token distribution. Without a cap, one rank can overflow its buffer when many tokens happen to prefer its experts.
moe_capacity_factor = 1.25(Switch default) bounds the buffer predictably.
[model]
moe_capacity_factor = 1.25 # only meaningful with distributed.ep > 1
Start at 1.25; raise to 1.5 if you see ~5%+ token drop in metrics, lower to 1.0 if throughput is buffer-bound. See Capacity and dispatch § When to use capacity.
Packed experts: on if num_experts ≥ 16¶
[model]
moe_packed_experts = true
Replaces the ModuleList of experts with three packed
(num_experts, dim, hidden) tensors. Measured speedups from
benchmarks/moe_packed:
|
Unpacked |
Packed |
Speedup |
|---|---|---|---|
8 |
48,521 tok/s |
50,972 tok/s |
+5.1% |
16 |
26,994 tok/s |
36,860 tok/s |
+36.5% |
64 |
1,796 tok/s |
2,204 tok/s |
+22.7% |
At 8 experts the win is marginal; at 16+ it’s worth the flag. Default is off because the EP integration (slicing packed tensors on the expert axis) is newer than the unpacked path.
Composition caveats¶
MoE + TP: fine, but TP doesn’t touch experts¶
TP applies to attention Q/K/V/O projections. In MoE layers, expert weights stay replicated across the TP group; TP gives you no memory savings on experts. Use EP for that.
# kempnerforge/distributed/tensor_parallel.py — _apply_block_tp (trimmed)
if isinstance(block.mlp, MoEMLP):
pass # experts replicated, TP on attention only
Sequence-parallel is also disabled for MoE blocks (boolean indexing in
expert dispatch breaks Shard(1) DTensors).
MoE + FP8: experts stay bf16¶
FP8 conversion is applied to dense Linears only. Three classes are
excluded: routed experts, shared expert, router gate. See
MoE + FP8 for the rationale. Practical consequence:
FP8 throughput lift on an MoE model is smaller than on a dense model
of the same active-parameter count, because expert-weight GEMMs run
at bf16 regardless.
FP8 + EP composes fine — they’re orthogonal (EP moves experts across ranks, FP8 wouldn’t touch them anyway).
FP8 + TP does not compose — JobConfig.__post_init__ rejects it
because Float8Linear’s DTensor strategy is incomplete. See
FP8 § TP incompatibility.
MoE + PP: not supported¶
MoE + Pipeline Parallelism is not supported. MoE layers use
all-to-all communication which conflicts with the pipeline schedule.
Raised in JobConfig.__post_init__ when num_experts > 0 and
pp > 1. For very large MoE runs that need PP, you’d need a
schedule that interleaves all-to-all with pipeline microbatches —
not in KempnerForge today.
Minimal production recipe¶
A reasonable starting point for a 4B-total / 1.8B-active MoE on 32 H200 GPUs:
[model]
num_experts = 8
moe_top_k = 2
moe_router = "softmax_topk" # or "sigmoid_topk" for DeepSeek-V3
moe_frequency = 1
moe_aux_loss_weight = 0.01
moe_capacity_factor = 1.25 # because EP is on
moe_packed_experts = false # 8 experts: not worth it
moe_gradient_scale = false # baseline only
[distributed]
tp = 4
ep = 2
dp_shard = 4
[train]
mixed_precision = "bf16" # fp8 has limited lift on MoE; stay bf16 for this recipe
Copy from configs/train/moe_ep_32gpu.toml
if you want to run it directly.
See also¶
MoE § overview — the subsystem pages this guide pulls together.
Routers — full mechanics of the two routers.
Aux loss and balancing — coefficient tuning and bias schedules.
Capacity and dispatch — drop policy and grouped GEMM path.
Expert parallelism — EP mechanics and measured throughput.
Benchmarks § MoE Expert Parallelism — full 32-GPU measurement table.
Validation rules — the MoE cross-section config checks.