kempnerforge.model¶
Model architectures for KempnerForge.
- class kempnerforge.model.Transformer[source]¶
Bases:
ModuleFull transformer model built from ModelConfig.
Embedding → TransformerBlocks → Norm → Output Head
- __init__(config, *, vlm_config=None, num_image_tokens=0)[source]¶
- Parameters:
config (ModelConfig)
vlm_config (VLMConfig | None)
num_image_tokens (int)
- Return type:
None
- init_weights_and_freqs()[source]¶
Initialize weights and RoPE frequencies after meta-device materialization.
Called after
model.to_empty(device=...)to fill in parameter values and compute RoPE frequency table. Safe to call on already-initialized models (skips if freqs are already computed).- Return type:
None
- set_moe_step(step, max_steps)[source]¶
Set training step on all MoE routers for adaptive bias scheduling.
- get_moe_aux_loss()[source]¶
Collect auxiliary losses from all MoE layers. Returns 0 if dense.
- Return type:
- get_expert_counts()[source]¶
Collect per-layer expert utilization. Returns {} if dense.
- Return type:
- forward(tokens=None, *, modality=None, kv_caches=None, doc_ids=None)[source]¶
Forward pass.
Exactly one of
tokensormodality.inputs_embedsmust be provided. Modality-injection routes (prefix_embeds,output_slice,image_features,image_mask,modality_ids) are grouped on the optionalModalityContextarg; seekempnerforge/model/modality.pyfor the full intra-context invariant table.- Parameters:
tokens (torch.Tensor | None) – Integer token ids, shape
(batch, seq_len).modality (ModalityContext | None) – Optional
ModalityContextbundling pre-embedded inputs, prefix embeds, output slicing, image features, and modality routing tags for VLM arches.Noneis the pure text-only forward.kv_caches (list[KVCache] | None) – Optional list of KVCache (one per layer) for generation. When provided, RoPE positions are offset by the current cache fill level. Cross-arg invariant:
kv_cachesforbidsmodality.prefix_embeds,modality.output_slice,modality.image_features, andmodality.modality_ids(all training-only).doc_ids (torch.Tensor | None) – Optional per-token document IDs for packed sequences, shape
(batch, seq_len). Enables block-diagonal causal attention that isolates documents within packed sequences.
- Returns:
Logits tensor of shape
(batch, out_seq_len, vocab_size)whereout_seq_len == seq_lennormally or the sliced length whenmodality.output_sliceis set.- Return type:
- class kempnerforge.model.TransformerBlock[source]¶
Bases:
ModuleSingle transformer block with pre-norm architecture.
Structure: norm → attention → residual, norm → mlp → residual
- __init__(config, layer_idx)[source]¶
- Parameters:
config (ModelConfig)
layer_idx (int)
- Return type:
None
- forward(x, rope_cos, rope_sin, *, kv_cache=None, doc_ids=None)[source]¶
- Parameters:
x (torch.Tensor)
rope_cos (torch.Tensor)
rope_sin (torch.Tensor)
kv_cache (KVCache | None)
doc_ids (torch.Tensor | None)
- Return type:
Modules
Vision-to-LLM adapter modules. |
|
Multi-head attention with Grouped-Query Attention (GQA) support. |
|
Cross-attention block for VLM Cross-Attention architecture. |
|
Token embedding and output head for KempnerForge models. |
|
Autoregressive text generation with KV-cache. |
|
Activation extraction hooks for mechanistic interpretability. |
|
Weight initialization strategies for KempnerForge models. |
|
Feed-forward network implementations for KempnerForge models. |
|
Modality-injection container for |
|
Mixture-of-Experts feed-forward layer for KempnerForge models. |
|
Mixture-of-Transformers (MoT) operator and block. |
|
Normalization layers for KempnerForge models. |
|
Rotary Position Embedding (RoPE) for KempnerForge models. |
|
MoE router implementations for KempnerForge models. |
|
Transformer model for KempnerForge. |
|
Vision encoders for VLM training. |
|
Vision-language model wrapper. |