kempnerforge.model.transformer

Transformer model for KempnerForge.

Architecture: Llama-style pre-norm transformer.

Token Embedding → [TransformerBlock × N] → Final Norm → Output Head

Design choices:
  • ModuleDict (not ModuleList) for layers — preserves FQNs for DCP checkpointing.

  • Embedding and output head are optional (can be None for PP middle stages).

  • Forward is a simple loop over blocks — pipeline-parallelism friendly.

Classes

Transformer

Full transformer model built from ModelConfig.

TransformerBlock

Single transformer block with pre-norm architecture.

class kempnerforge.model.transformer.TransformerBlock[source]

Bases: Module

Single transformer block with pre-norm architecture.

Structure: norm → attention → residual, norm → mlp → residual

__init__(config, layer_idx)[source]
Parameters:
Return type:

None

forward(x, rope_cos, rope_sin, *, kv_cache=None, doc_ids=None)[source]
Parameters:
Return type:

torch.Tensor

class kempnerforge.model.transformer.Transformer[source]

Bases: Module

Full transformer model built from ModelConfig.

Embedding → TransformerBlocks → Norm → Output Head

__init__(config)[source]
Parameters:

config (ModelConfig)

Return type:

None

init_weights_and_freqs()[source]

Initialize weights and RoPE frequencies after meta-device materialization.

Called after model.to_empty(device=...) to fill in parameter values and compute RoPE frequency table. Safe to call on already-initialized models (skips if freqs are already computed).

Return type:

None

set_moe_step(step, max_steps)[source]

Set training step on all MoE routers for adaptive bias scheduling.

Parameters:
Return type:

None

get_moe_aux_loss()[source]

Collect auxiliary losses from all MoE layers. Returns 0 if dense.

Return type:

torch.Tensor

get_expert_counts()[source]

Collect per-layer expert utilization. Returns {} if dense.

Return type:

dict[int, torch.Tensor]

forward(tokens, *, kv_caches=None, doc_ids=None)[source]

Forward pass.

Parameters:
  • tokens (torch.Tensor) – Integer tensor of shape (batch, seq_len).

  • kv_caches (list[KVCache] | None) – Optional list of KVCache (one per layer) for generation. When provided, RoPE positions are offset by the current cache fill level so incremental decode tokens get correct positions.

  • doc_ids (torch.Tensor | None) – Optional per-token document IDs for packed sequences, shape (batch, seq_len). Enables block-diagonal causal attention that isolates documents within packed sequences.

Returns:

Logits tensor of shape (batch, seq_len, vocab_size).

Return type:

torch.Tensor