kempnerforge.model

Model architectures for KempnerForge.

class kempnerforge.model.Transformer[source]

Bases: Module

Full transformer model built from ModelConfig.

Embedding → TransformerBlocks → Norm → Output Head

__init__(config, *, vlm_config=None, num_image_tokens=0)[source]
Parameters:
Return type:

None

init_weights_and_freqs()[source]

Initialize weights and RoPE frequencies after meta-device materialization.

Called after model.to_empty(device=...) to fill in parameter values and compute RoPE frequency table. Safe to call on already-initialized models (skips if freqs are already computed).

Return type:

None

set_moe_step(step, max_steps)[source]

Set training step on all MoE routers for adaptive bias scheduling.

Parameters:
Return type:

None

get_moe_aux_loss()[source]

Collect auxiliary losses from all MoE layers. Returns 0 if dense.

Return type:

torch.Tensor

get_expert_counts()[source]

Collect per-layer expert utilization. Returns {} if dense.

Return type:

dict[int, torch.Tensor]

forward(tokens=None, *, modality=None, kv_caches=None, doc_ids=None)[source]

Forward pass.

Exactly one of tokens or modality.inputs_embeds must be provided. Modality-injection routes (prefix_embeds, output_slice, image_features, image_mask, modality_ids) are grouped on the optional ModalityContext arg; see kempnerforge/model/modality.py for the full intra-context invariant table.

Parameters:
  • tokens (torch.Tensor | None) – Integer token ids, shape (batch, seq_len).

  • modality (ModalityContext | None) – Optional ModalityContext bundling pre-embedded inputs, prefix embeds, output slicing, image features, and modality routing tags for VLM arches. None is the pure text-only forward.

  • kv_caches (list[KVCache] | None) – Optional list of KVCache (one per layer) for generation. When provided, RoPE positions are offset by the current cache fill level. Cross-arg invariant: kv_caches forbids modality.prefix_embeds, modality.output_slice, modality.image_features, and modality.modality_ids (all training-only).

  • doc_ids (torch.Tensor | None) – Optional per-token document IDs for packed sequences, shape (batch, seq_len). Enables block-diagonal causal attention that isolates documents within packed sequences.

Returns:

Logits tensor of shape (batch, out_seq_len, vocab_size) where out_seq_len == seq_len normally or the sliced length when modality.output_slice is set.

Return type:

torch.Tensor

class kempnerforge.model.TransformerBlock[source]

Bases: Module

Single transformer block with pre-norm architecture.

Structure: norm → attention → residual, norm → mlp → residual

__init__(config, layer_idx)[source]
Parameters:
Return type:

None

forward(x, rope_cos, rope_sin, *, kv_cache=None, doc_ids=None)[source]
Parameters:
Return type:

torch.Tensor

Modules

adapter

Vision-to-LLM adapter modules.

attention

Multi-head attention with Grouped-Query Attention (GQA) support.

cross_attention

Cross-attention block for VLM Cross-Attention architecture.

embedding

Token embedding and output head for KempnerForge models.

generate

Autoregressive text generation with KV-cache.

hooks

Activation extraction hooks for mechanistic interpretability.

init

Weight initialization strategies for KempnerForge models.

mlp

Feed-forward network implementations for KempnerForge models.

modality

Modality-injection container for Transformer.forward.

moe

Mixture-of-Experts feed-forward layer for KempnerForge models.

mot

Mixture-of-Transformers (MoT) operator and block.

norm

Normalization layers for KempnerForge models.

position

Rotary Position Embedding (RoPE) for KempnerForge models.

router

MoE router implementations for KempnerForge models.

transformer

Transformer model for KempnerForge.

vision

Vision encoders for VLM training.

vlm

Vision-language model wrapper.