Mixture of Experts¶
KempnerForge’s MoE implementation covers Mixtral-style (softmax top-k,
Switch aux loss) and DeepSeek-V3-style (sigmoid top-k with bias-based
balancing) routing. Everything MoE-specific lives under
kempnerforge/model/moe.py
and
kempnerforge/model/router.py;
the distributed mechanics (all-to-all, expert parallelism) live under
Distributed.
What’s on this section¶
Routers —
softmax_topkvssigmoid_topk, registry selection, shared-expert composition.Aux loss and balancing — Switch-style aux loss, bias-based EMA balancing, sequence-level aux loss, per-expert gradient scaling.
Capacity and dispatch — capacity factor, token drop policy, grouped GEMM vs sequential path, packed experts.
MoE + FP8 — which Linears get excluded from Float8 conversion and why.
Config at a glance¶
Field |
Default |
What it controls |
|---|---|---|
|
|
|
|
|
Number of experts each token routes to |
|
|
Router type (see Routers) |
|
|
Every Nth layer is MoE; others are dense |
|
|
Dense expert applied to every token on top of routed |
|
|
|
|
|
Coefficient on |
|
|
Packed weight tensors instead of |
|
|
Per-expert output scaling by utilization |
|
|
Sigmoid-only: sequence-level balance penalty |
|
|
Sigmoid-only: bias update rate schedule |
All fields live in
kempnerforge/config/model.py
(ModelConfig).
MoE layer placement¶
moe_frequency = 1 makes every transformer block an MoE block.
moe_frequency = 2 alternates dense / MoE layers, matching the
DeepSeek-V2/V3 recipe. The dense layers use a plain SwiGLUMLP; only
the MoE blocks hit the routing, dispatch, and aux-loss machinery on
this section.
Cross-section constraints¶
MoE + TP on experts: not supported. Expert weights stay replicated across the TP group.
MoE + PP: not supported — the config validator raises in
JobConfig.__post_init__.MoE + EP: supported.
num_expertsmust be divisible byep_world_size. See Expert parallelism.MoE + FP8: supported with exclusions. Experts, shared expert, and router gate stay bf16 — MoE + FP8.
Pages¶
See also¶
Expert parallelism — the all-to-all dispatch/combine used when
ep_world_size > 1.MoE experiments — end-to-end workflow and diagnosis recipes.
Validation rules — cross-section config checks (MoE + PP unsupported,
num_experts % ep == 0).Benchmarks § MoE Expert Parallelism — measured throughput across EP sizes.