Routers¶
KempnerForge ships two routers, registered under the "router" category
of the component registry:
Registry key |
Class |
Style |
|---|---|---|
|
Mixtral — softmax probabilities + Switch-Transformer aux loss |
|
|
DeepSeek-V3 — per-expert sigmoid scores + learnable bias balancing |
Both produce the same output shape — (weights, indices) of shape
(num_tokens, top_k) — so MoEMLP works with either. They differ in
how load balancing is maintained and what loss signal (if any) gets
added to the training loss.
SoftmaxTopKRouter — Mixtral-style¶
# kempnerforge/model/router.py — forward
logits = self.gate(x) # (T, E)
probs = F.softmax(logits, dim=-1) # (T, E)
weights, indices = torch.topk(probs, k=self.top_k) # (T, K)
weights = weights / weights.sum(dim=-1, keepdim=True)
Each token gets softmax probabilities over all experts; the top K
are kept and renormalized. The auxiliary load-balancing loss is
Switch-Transformer’s:
L_aux = num_experts · Σ_i (f_i · P_i)
where f_i is the fraction of tokens assigned to expert i (hard
counts, detached) and P_i is the mean softmax probability for
expert i (differentiable through the gate). Gradient flows through
P_i and pushes the gate to lower scores for over-utilized experts.
self.aux_loss is updated on every forward; the training loop picks
it up via Transformer.get_moe_aux_loss() and adds
moe_aux_loss_weight · aux_loss to the main loss (see
Aux loss and balancing).
Use when: default, Mixtral-style runs, or when you want the balancing signal explicit in the loss. The aux loss coefficient is well-studied territory (Switch set 0.01, Mixtral uses similar).
SigmoidTopKRouter — DeepSeek-V3 style¶
# kempnerforge/model/router.py — forward
logits = self.gate(x) # (T, E)
scores = torch.sigmoid(logits + self.expert_bias) # (T, E) — bias added
weights, indices = torch.topk(scores, k=self.top_k)
weights = weights / weights.sum(dim=-1, keepdim=True)
Three things differ from the softmax router:
Sigmoid, not softmax. Expert scores are independent per expert — no competition for normalization. Raising expert i’s score doesn’t lower expert j’s.
expert_biasis a learnablenn.Parameteradded to the logits before sigmoid. It’s not updated by the optimizer — the training loop adjusts it manually via an EMA of per-expert utilization (see Aux loss and balancing § Bias adjustment).No auxiliary loss by default.
self.aux_loss = 0.0unlesssequence_aux_loss_weight > 0(opt-in lightweight balance penalty, covered on the aux-loss page).
The claim from DeepSeek-V3 is that bias-based balancing reaches uniform expert utilization without injecting a balance-loss gradient into the main loss — so the routed experts learn a slightly cleaner signal. Empirically both approaches train; the sigmoid router is the one you pick when chasing DeepSeek-V3’s recipe specifically.
Use when: DeepSeek-V3 reproduction, or long MoE runs where auxiliary-loss coefficient tuning is annoying and you want the balancer off the main loss path.
Side effects¶
Both routers store two tensors as forward-time side effects:
self.aux_loss: torch.Tensor # scalar, picked up by Transformer.get_moe_aux_loss()
self.expert_counts: torch.Tensor # (num_experts,), picked up by get_expert_counts()
expert_counts holds per-expert token counts (hard, detached) for the
most recent forward. It’s how hot/cold expert diagnosis works — see
MoE experiments § Hot/cold expert diagnosis.
Builders¶
Builders live in the same file and are registered at import time:
# kempnerforge/model/router.py
def _build_softmax_topk(dim, num_experts, top_k): ...
def _build_sigmoid_topk(dim, num_experts, top_k, **kwargs): ...
registry.register("router", "softmax_topk", _build_softmax_topk)
registry.register("router", "sigmoid_topk", _build_sigmoid_topk)
Selection is config-driven:
[model]
num_experts = 8
moe_top_k = 2
moe_router = "sigmoid_topk" # or "softmax_topk" (default)
build_moe()
forwards the right kwargs to whichever router is selected (only
sigmoid_topk consumes sequence_aux_loss_weight and
bias_schedule).
Picking a router¶
|
|
|
|---|---|---|
Balance mechanism |
Auxiliary loss adds a term to main loss |
EMA-driven bias adjustment; no loss term (by default) |
Coefficient to tune |
|
|
Best for |
Mixtral baseline, pre-tuned aux-loss recipes |
DeepSeek-V3 reproduction, minimal main-loss interference |
Aux-loss-free |
No — aux loss always non-zero |
Yes by default |
Shared experts |
Works, but less common in published recipes |
Standard DeepSeek pattern ( |
See also¶
Aux loss and balancing — how the two routers’ balance signals flow into training.
Capacity and dispatch — what happens to the
(weights, indices)tuple after routing.MoE + FP8 — why
router.gateis excluded from Float8 conversion.Registry — how the
"router"category fits with the other 6 registry categories.MoE experiments — end-to-end workflow using one or both routers.