kempnerforge.config.model¶
Model architecture configuration.
Classes
Architecture hyperparameters for a transformer model. |
|
- class kempnerforge.config.model.NormType[source]¶
Bases:
StrEnum- rmsnorm = 'rmsnorm'¶
- layernorm = 'layernorm'¶
- __new__(value)¶
- class kempnerforge.config.model.Activation[source]¶
Bases:
StrEnum- silu = 'silu'¶
- gelu = 'gelu'¶
- relu = 'relu'¶
- __new__(value)¶
- class kempnerforge.config.model.ModelConfig[source]¶
Bases:
objectArchitecture hyperparameters for a transformer model.
- activation: Activation = 'silu'¶
FFN hidden dimension, rounded to nearest multiple of 256 for hardware efficiency.
Per-expert FFN hidden dim =
computed_ffn_hidden_dim*moe_expert_ffn_multiplier.Rounded to a multiple of 16 for tensor-core alignment. With the default multiplier 1.0 this equals
computed_ffn_hidden_dim(zero behavior change); set 0.5 for fine-grained experts so top-2 routing matches the dense FFN’s activated FLOPs (2 * F/2 = F). Applies to routed and shared experts wherever they are built (build_moe and MoMa’s ExpertChoiceMoE).
- property num_params_estimate: int¶
Rough total parameter count estimate (excluding embedding if tied).
For MoE models, counts all expert parameters (total, not active).
- __init__(dim=4096, n_layers=32, n_heads=32, n_kv_heads=None, vocab_size=32000, ffn_dim_multiplier=1.0, ffn_hidden_dim=None, norm_type=NormType.rmsnorm, norm_eps=1e-05, activation=Activation.silu, max_seq_len=2048, rope_theta=10000.0, tie_embeddings=False, qk_norm=False, init_std=0.02, model_type='transformer', sdpa_backend='auto', num_experts=0, moe_top_k=2, moe_frequency=1, moe_router='softmax_topk', moe_shared_experts=0, moe_aux_loss_weight=0.01, moe_capacity_factor=0.0, moe_sequence_aux_loss_weight=0.0, moe_gradient_scale=False, moe_bias_schedule='constant', moe_packed_experts=False, moe_expert_ffn_multiplier=1.0, moe_router_z_loss_weight=0.0)¶
- Parameters:
dim (int)
n_layers (int)
n_heads (int)
n_kv_heads (int | None)
vocab_size (int)
ffn_dim_multiplier (float)
ffn_hidden_dim (int | None)
norm_type (NormType)
norm_eps (float)
activation (Activation)
max_seq_len (int)
rope_theta (float)
tie_embeddings (bool)
qk_norm (bool)
init_std (float)
model_type (str)
sdpa_backend (str)
num_experts (int)
moe_top_k (int)
moe_frequency (int)
moe_router (str)
moe_shared_experts (int)
moe_aux_loss_weight (float)
moe_capacity_factor (float)
moe_sequence_aux_loss_weight (float)
moe_gradient_scale (bool)
moe_bias_schedule (str)
moe_packed_experts (bool)
moe_expert_ffn_multiplier (float)
moe_router_z_loss_weight (float)
- Return type:
None