Mixture of Experts

KempnerForge’s MoE implementation covers Mixtral-style (softmax top-k, Switch aux loss) and DeepSeek-V3-style (sigmoid top-k with bias-based balancing) routing. Everything MoE-specific lives under kempnerforge/model/moe.py and kempnerforge/model/router.py; the distributed mechanics (all-to-all, expert parallelism) live under Distributed.

What’s on this section

  • Routerssoftmax_topk vs sigmoid_topk, registry selection, shared-expert composition.

  • Aux loss and balancing — Switch-style aux loss, bias-based EMA balancing, sequence-level aux loss, per-expert gradient scaling.

  • Capacity and dispatch — capacity factor, token drop policy, grouped GEMM vs sequential path, packed experts.

  • MoE + FP8 — which Linears get excluded from Float8 conversion and why.

Config at a glance

Field

Default

What it controls

num_experts

0

>0 enables MoE layers

moe_top_k

2

Number of experts each token routes to

moe_router

"softmax_topk"

Router type (see Routers)

moe_frequency

1

Every Nth layer is MoE; others are dense

moe_shared_experts

0

Dense expert applied to every token on top of routed

moe_capacity_factor

0.0

>0 caps tokens per expert (see Capacity and dispatch)

moe_aux_loss_weight

0.01

Coefficient on aux_loss added to main loss

moe_packed_experts

false

Packed weight tensors instead of ModuleList

moe_gradient_scale

false

Per-expert output scaling by utilization

moe_sequence_aux_loss_weight

0.0

Sigmoid-only: sequence-level balance penalty

moe_bias_schedule

"constant"

Sigmoid-only: bias update rate schedule

All fields live in kempnerforge/config/model.py (ModelConfig).

MoE layer placement

moe_frequency = 1 makes every transformer block an MoE block. moe_frequency = 2 alternates dense / MoE layers, matching the DeepSeek-V2/V3 recipe. The dense layers use a plain SwiGLUMLP; only the MoE blocks hit the routing, dispatch, and aux-loss machinery on this section.

Cross-section constraints

  • MoE + TP on experts: not supported. Expert weights stay replicated across the TP group.

  • MoE + PP: not supported — the config validator raises in JobConfig.__post_init__.

  • MoE + EP: supported. num_experts must be divisible by ep_world_size. See Expert parallelism.

  • MoE + FP8: supported with exclusions. Experts, shared expert, and router gate stay bf16 — MoE + FP8.

Pages

See also