kempnerforge.config.vlm¶
VLM (vision-language model) configuration.
VLMConfig carries the arch-level knobs of the vision-language
model: which architecture to wire (arch), the fixed text padding
length, and the freeze policy. The vision encoder and adapter are
described by sibling top-level sections (VisionEncoderConfig in
config/vision.py, AdapterConfig in config/adapter.py).
In TOML, [vlm] is a top-level section, parallel to [model],
[vision_encoder], and [adapter]. When [vlm] is absent the
job is a pure text run.
Architecture is a discriminated union on the arch field:
"joint_decoder"image tokens prepended to the text sequence."cross_attention"image K/V flows in via separate cross-attention blocks at a configurable cadence."mot"Mixture-of-Transformers: per-modality Q/K/V/O + per- modality FFN at every layer, single global self-attention.
Each arch gets its own VLMConfig subclass, registered via
registry.register_vlm_config. The TOML loader dispatches on
arch to instantiate the right subclass; programmatic callers use
VLMConfig.for_arch(arch_name, **fields).
FreezeSpec / FreezeStage are consumed by
kempnerforge/training/freeze.py.
Classes
Cross-Attention: image K/V flows into separate cross-attention blocks inserted at a configurable cadence. |
|
A single freeze directive. |
|
A freeze directive that applies from |
|
Joint-Decoder: image tokens prepended to the text sequence. |
|
Mixture-of-Transformers: per-modality Q/K/V/O projections + per- modality FFN at every layer; single global self-attention mixes all modality streams (Liang et al. 2024, Algorithm 1). |
|
Base VLM configuration. |
- class kempnerforge.config.vlm.FreezeSpec[source]¶
Bases:
objectA single freeze directive.
moduleis an alias (key in a pattern map such asDEFAULT_MODULE_PATTERNS) or a raw fnmatch pattern matching fully-qualified parameter names.
- class kempnerforge.config.vlm.FreezeStage[source]¶
Bases:
objectA freeze directive that applies from
start_steponward.Used for staged training recipes where the trainable subset changes across training phases. The list of stages on
VLMConfigis expected to be in strictly monotonicstart_steporder.- specs: tuple[FreezeSpec, ...]¶
- __init__(start_step, specs)¶
- Parameters:
start_step (int)
specs (tuple[FreezeSpec, ...])
- Return type:
None
- class kempnerforge.config.vlm.VLMConfig[source]¶
Bases:
objectBase VLM configuration.
Subclasses register themselves via
@registry.register_vlm_configand override thearchfield’s default. UseVLMConfig.for_arch(arch_name, **fields)to construct programmatically; the TOML loader dispatches onarchautomatically.Field summary (full per-field docs are picked up from autodoc):
archVLM architecture discriminator. Subclasses set this via field default; direct construction with an arch name not backed by a registered subclass raises.max_text_lenfixed text padding length used byVLMCollator. Enforces rank-consistent batches under FSDP2.freezestatic freeze specs applied once at build time.freeze_schedulestep-boundary freeze transitions.module_patternsmap of module alias ("transformer","vision_encoder","adapter", plus arch-specific additions) to fnmatch pattern list.
- freeze: list[FreezeSpec]¶
- freeze_schedule: list[FreezeStage]¶
- residual_stream_image_tokens(num_tokens)[source]¶
Number of image tokens this arch places in the residual stream.
Used by
JobConfigto validate thatmodel.max_seq_lenandtrain.seq_lenare large enough to fitresidual_stream_image_tokens + max_text_lenalong the attention sequence dimension.Joint-Decoder / MoT:
num_tokens(image tokens prepended to text).Cross-Attention:
0(residual stream is text-only; image features flow side-channel into CA blocks).
- classmethod for_arch(arch, **kwargs)[source]¶
Resolve
archto a registered subclass and instantiate.- Raises:
ValueError –
archis not registered.NotImplementedError –
archis reserved (in_RESERVED_ARCHS) matches loader semantics so the error type is independent of construction site.
- Parameters:
- Return type:
Example
>>> cfg = VLMConfig.for_arch( ... "cross_attention", ... max_text_len=2048, ... cross_attention_every_n_layers=4, ... )
- class kempnerforge.config.vlm.JointDecoderConfig[source]¶
Bases:
VLMConfigJoint-Decoder: image tokens prepended to the text sequence.
No additional fields beyond
VLMConfig. The arch is wired throughVLMWrapper+ModalityContext.prefix_embeds+output_slice.
- class kempnerforge.config.vlm.CrossAttentionConfig[source]¶
Bases:
VLMConfigCross-Attention: image K/V flows into separate cross-attention blocks inserted at a configurable cadence.
The CA-specific module alias
"cross_attention"is added tomodule_patternsso freeze targeting works out of the box.- residual_stream_image_tokens(num_tokens)[source]¶
Cross-Attention does not extend the residual stream.
Image features flow as K/V into separate CrossAttentionBlocks; the residual itself carries text only. So the seq_len cross-check skips
num_tokensand just enforcesseq_len >= max_text_len. Thenum_tokensargument is accepted for signature parity with the base method but ignored.
- resolved_heads(model_n_heads)[source]¶
Resolve zero-defaults against the text backbone’s head count.
Returns
(n_heads, n_kv_heads)such that theCrossAttentionBlockconstructor never observes 0.Resolution rule:
n_heads = self.cross_attention_n_heads or model_n_headsn_kv_heads = self.cross_attention_n_kv_heads or n_heads
- __init__(arch='cross_attention', max_text_len=512, freeze=<factory>, freeze_schedule=<factory>, module_patterns=<factory>, cross_attention_every_n_layers=4, cross_attention_n_heads=0, cross_attention_n_kv_heads=0)¶
- class kempnerforge.config.vlm.MoTConfig[source]¶
Bases:
VLMConfigMixture-of-Transformers: per-modality Q/K/V/O projections + per- modality FFN at every layer; single global self-attention mixes all modality streams (Liang et al. 2024, Algorithm 1).
Image tokens are prepended to the text sequence in the residual stream (image-then-text concat order).
modality_idstags every position with its source modality; the operator routes per-token through the per-modality projection / FFN copy for that position.The MoT-specific module alias
"mot"is added tomodule_patternsso freeze targeting works out of the box:FreezeSpec("mot", True)freezes the per-modality main stack (transformer.layers.*) without touching the embedding / output head / final norms.- residual_stream_image_tokens(num_tokens)[source]¶
MoT prepends
num_tokensimage tokens to the text sequence (same residual-stream layout as Joint-Decoder).
- resolved_image_heads(model_n_heads, model_n_kv_heads=0)[source]¶
Resolve zero-defaults against the text backbone’s head counts.
Returns
(n_heads, n_kv_heads)such that the operator’s per-modality projection sizes are never built from 0.Resolution rule:
n_heads = self.mot_image_n_heads or model_n_headsn_kv_heads = self.mot_image_n_kv_heads or model_n_kv_heads or n_heads
v1 note: the global-SDPA design requires equal head counts across modalities;
Transformer.__init__asserts the resolved tuple matches the text backbone (raise on per-modality override). Field is present so a future per-modality relaxation can land without a config-shape change.
- __init__(arch='mot', max_text_len=512, freeze=<factory>, freeze_schedule=<factory>, module_patterns=<factory>, mot_modalities=('image', 'text'), mot_image_n_heads=0, mot_image_n_kv_heads=0, mot_warm_start_from_text=False, mot_warm_start_path='')¶
- Parameters:
- Return type:
None