kempnerforge.model.adapter¶
Vision-to-LLM adapter modules (the “connector”).
The adapter projects vision features (shape (B, num_tokens, feature_dim))
into the LLM embedding space (shape (B, out_tokens, model.dim)). It sits
between the vision encoder and the transformer in VLMWrapper.
Two families:
Projection adapters keep the token count (
out_tokens == num_tokens):mlp_2layer(default, the canonical LLaVA-family 2-layer MLP) andlinear(singlenn.Linear, an ablation baseline).Pooling adapters reduce the token count by pooling the square patch grid before projecting:
avgpool(window-average, the cheapest reducer) andattentional_pool(Molmo2-style per-window multi-head attention with the window mean as query). Pooling is what makes many-frame video fit the sequence budget: a 27×27 SigLIP grid (729 tokens) pools to 81 tokens at a 3×3 window.
Every adapter is a VisionAdapter exposing output_num_tokens(n_in) so the
build path can size the residual stream (and MoT’s positional split) without a
dry-run forward. Adapters register themselves under the adapter registry
category.
Functions
|
Dispatch to the registered adapter builder. |
|
Token count out of a |
Classes
Attentional pooling connector (Molmo2 §3.1). |
|
Average-pool a square patch grid by a window, then project. |
|
Single |
|
2-layer MLP from image-feature dim to LLM embedding dim. |
|
Base class for vision→LLM adapters (the connector). |
- kempnerforge.model.adapter.pooled_token_count(num_input_tokens, window, *, require_divisible=False)[source]¶
Token count out of a
window×windowpool over a square patch grid.A vision encoder emits
num_input_tokenspatch tokens laid out on a squaregrid × gridmap (grid = sqrt(num_input_tokens)). Pooling with awindow × windowkernel and ceil edges yieldsceil(grid/window) ** 2tokens; edge windows that do not fill the kernel pool only the patches they cover (Molmo2 §A: “the bottom and far-right image patches are pooled with a reduced number of patches”).Connectors that cannot pool ragged edges (
require_divisible=True, e.g.attentional_pool) raise whengridis not divisible bywindow, so a ragged config is rejected at config/build time rather than deterministically failing inforwardat the first step.This is the single source of truth for the post-pool count: it must equal the pooling adapters’ actual
forwardoutput length, because the build path uses it to size MoT’s positional split.
- class kempnerforge.model.adapter.VisionAdapter[source]¶
Bases:
ModuleBase class for vision→LLM adapters (the connector).
Contract:
forwardmaps(B, N, in_dim) -> (B, M, out_dim)whereM == output_num_tokens(N). Projection adapters keepM == N; pooling adapters reduce it.output_num_tokenslets the build path size the residual stream and MoT’s positional split without a dry-run forward, and must agree exactly with the forward output length.- output_num_tokens(num_input_tokens)[source]¶
Tokens emitted per image given
num_input_tokenspatch tokens in.Identity by default (projection adapters); pooling adapters override.
- forward(x)[source]¶
- Parameters:
x (torch.Tensor)
- Return type:
- class kempnerforge.model.adapter.MLP2LayerAdapter[source]¶
Bases:
VisionAdapter2-layer MLP from image-feature dim to LLM embedding dim.
Architecture:
Linear(in_dim, hidden) -> activation -> Linear(hidden, out_dim).hidden_dim=Nonedefaults toout_dim. Keeps the token count.reset_parametersis provided so callers that materialize adapters from meta can re-initialize weights with the standard Linear defaults.- reset_parameters()[source]¶
Re-run
nn.Lineardefault init on both projections.Used after
to_empty(device=...)on a meta-device build.- Return type:
None
- forward(x)[source]¶
- Parameters:
x (torch.Tensor)
- Return type:
- class kempnerforge.model.adapter.LinearAdapter[source]¶
Bases:
VisionAdapterSingle
nn.Linearfrom image-feature dim to LLM embedding dim.No activation, no hidden layer. Keeps the token count. Useful as an ablation baseline against
MLP2LayerAdapter.- forward(x)[source]¶
- Parameters:
x (torch.Tensor)
- Return type:
- class kempnerforge.model.adapter.AvgPoolAdapter[source]¶
Bases:
VisionAdapterAverage-pool a square patch grid by a window, then project.
(B, N, in_dim)patch tokens (N == grid**2) are averaged overwindow × windowspatial windows (ceil edges; partial edge windows average only the real patches they cover), giving(B, M, in_dim)withM == ceil(grid/window)**2, then aLinearmapsin_dim -> out_dim.The cheapest token-count reducer (LLaVA-NeXT / sibling-repo style).
windowis overridable perforwardcall so one connector can pool images (e.g. 2×2) and video frames (3×3) with the same projection weights.- output_num_tokens(num_input_tokens)[source]¶
Tokens emitted per image given
num_input_tokenspatch tokens in.Identity by default (projection adapters); pooling adapters override.
- forward(x, pool_window=None)[source]¶
- Parameters:
x (torch.Tensor)
pool_window (int | None)
- Return type:
- class kempnerforge.model.adapter.AttentionalPoolAdapter[source]¶
Bases:
VisionAdapterAttentional pooling connector (Molmo2 §3.1).
For each
window × windowpatch window, a multi-head attention layer pools the window’s patches into one vector, using the mean of the window’s patches as the query and the patches themselves as keys/values; the result is projectedin_dim -> out_dim. Output length isceil(grid/window)**2.windowis overridable perforwardcall (shared params across image 2×2 and video 3×3 pooling, per the paper). v1 requires the grid be divisible by the window (no ragged edge windows); ragged attentional pooling is a follow-up.- output_num_tokens(num_input_tokens)[source]¶
Tokens emitted per image given
num_input_tokenspatch tokens in.Identity by default (projection adapters); pooling adapters override.
- forward(x, pool_window=None)[source]¶
- Parameters:
x (torch.Tensor)
pool_window (int | None)
- Return type: