kempnerforge.model

Model architectures for KempnerForge.

class kempnerforge.model.Transformer[source]

Bases: Module

Full transformer model built from ModelConfig.

Embedding → TransformerBlocks → Norm → Output Head

__init__(config)[source]
Parameters:

config (ModelConfig)

Return type:

None

init_weights_and_freqs()[source]

Initialize weights and RoPE frequencies after meta-device materialization.

Called after model.to_empty(device=...) to fill in parameter values and compute RoPE frequency table. Safe to call on already-initialized models (skips if freqs are already computed).

Return type:

None

set_moe_step(step, max_steps)[source]

Set training step on all MoE routers for adaptive bias scheduling.

Parameters:
Return type:

None

get_moe_aux_loss()[source]

Collect auxiliary losses from all MoE layers. Returns 0 if dense.

Return type:

torch.Tensor

get_expert_counts()[source]

Collect per-layer expert utilization. Returns {} if dense.

Return type:

dict[int, torch.Tensor]

forward(tokens, *, kv_caches=None, doc_ids=None)[source]

Forward pass.

Parameters:
  • tokens (torch.Tensor) – Integer tensor of shape (batch, seq_len).

  • kv_caches (list[KVCache] | None) – Optional list of KVCache (one per layer) for generation. When provided, RoPE positions are offset by the current cache fill level so incremental decode tokens get correct positions.

  • doc_ids (torch.Tensor | None) – Optional per-token document IDs for packed sequences, shape (batch, seq_len). Enables block-diagonal causal attention that isolates documents within packed sequences.

Returns:

Logits tensor of shape (batch, seq_len, vocab_size).

Return type:

torch.Tensor

class kempnerforge.model.TransformerBlock[source]

Bases: Module

Single transformer block with pre-norm architecture.

Structure: norm → attention → residual, norm → mlp → residual

__init__(config, layer_idx)[source]
Parameters:
Return type:

None

forward(x, rope_cos, rope_sin, *, kv_cache=None, doc_ids=None)[source]
Parameters:
Return type:

torch.Tensor

Modules

attention

Multi-head attention with Grouped-Query Attention (GQA) support.

embedding

Token embedding and output head for KempnerForge models.

generate

Autoregressive text generation with KV-cache.

hooks

Activation extraction hooks for mechanistic interpretability.

init

Weight initialization strategies for KempnerForge models.

mlp

Feed-forward network implementations for KempnerForge models.

moe

Mixture-of-Experts feed-forward layer for KempnerForge models.

norm

Normalization layers for KempnerForge models.

position

Rotary Position Embedding (RoPE) for KempnerForge models.

router

MoE router implementations for KempnerForge models.

transformer

Transformer model for KempnerForge.