Mixture of Experts¶

KempnerForge’s MoE implementation covers Mixtral-style (softmax top-k, Switch aux loss) and DeepSeek-V3-style (sigmoid top-k with bias-based balancing) routing. Everything MoE-specific lives under kempnerforge/model/moe.py and kempnerforge/model/router.py; the distributed mechanics (all-to-all, expert parallelism) live under Distributed.

What’s on this section¶

Routers — softmax_topk vs sigmoid_topk, registry selection, shared-expert composition.
Aux loss and balancing — Switch-style aux loss, bias-based EMA balancing, sequence-level aux loss, per-expert gradient scaling.
Capacity and dispatch — capacity factor, token drop policy, grouped GEMM vs sequential path, packed experts.
MoE + FP8 — which Linears get excluded from Float8 conversion and why.

Config at a glance¶

Field	Default	What it controls
`num_experts`	`0`	`>0` enables MoE layers
`moe_top_k`	`2`	Number of experts each token routes to
`moe_router`	`"softmax_topk"`	Router type (see Routers)
`moe_frequency`	`1`	Every Nth layer is MoE; others are dense
`moe_shared_experts`	`0`	Dense expert applied to every token on top of routed
`moe_capacity_factor`	`0.0`	`>0` caps tokens per expert (see Capacity and dispatch)
`moe_aux_loss_weight`	`0.01`	Coefficient on `aux_loss` added to main loss
`moe_packed_experts`	`false`	Packed weight tensors instead of `ModuleList`
`moe_gradient_scale`	`false`	Per-expert output scaling by utilization
`moe_sequence_aux_loss_weight`	`0.0`	Sigmoid-only: sequence-level balance penalty
`moe_bias_schedule`	`"constant"`	Sigmoid-only: bias update rate schedule

All fields live in kempnerforge/config/model.py (ModelConfig).

MoE layer placement¶

moe_frequency = 1 makes every transformer block an MoE block. moe_frequency = 2 alternates dense / MoE layers, matching the DeepSeek-V2/V3 recipe. The dense layers use a plain SwiGLUMLP; only the MoE blocks hit the routing, dispatch, and aux-loss machinery on this section.

Cross-section constraints¶

MoE + TP on experts: not supported. Expert weights stay replicated across the TP group.
MoE + PP: not supported — the config validator raises in JobConfig.__post_init__.
MoE + EP: supported. num_experts must be divisible by ep_world_size. See Expert parallelism.
MoE + FP8: supported with exclusions. Experts, shared expert, and router gate stay bf16 — MoE + FP8.

Mixture of Experts¶

What’s on this section¶

Config at a glance¶

MoE layer placement¶

Cross-section constraints¶

Pages¶

See also¶