kempnerforge.model.adapter

Vision-to-LLM adapter modules (the “connector”).

The adapter projects vision features (shape (B, num_tokens, feature_dim)) into the LLM embedding space (shape (B, out_tokens, model.dim)). It sits between the vision encoder and the transformer in VLMWrapper.

Two families:

  • Projection adapters keep the token count (out_tokens == num_tokens): mlp_2layer (default, the canonical LLaVA-family 2-layer MLP) and linear (single nn.Linear, an ablation baseline).

  • Pooling adapters reduce the token count by pooling the square patch grid before projecting: avgpool (window-average, the cheapest reducer) and attentional_pool (Molmo2-style per-window multi-head attention with the window mean as query). Pooling is what makes many-frame video fit the sequence budget: a 27×27 SigLIP grid (729 tokens) pools to 81 tokens at a 3×3 window.

Every adapter is a VisionAdapter exposing output_num_tokens(n_in) so the build path can size the residual stream (and MoT’s positional split) without a dry-run forward. Adapters register themselves under the adapter registry category.

Functions

build_adapter(adapter_config, in_dim, out_dim)

Dispatch to the registered adapter builder.

pooled_token_count(num_input_tokens, window, *)

Token count out of a window×window pool over a square patch grid.

Classes

AttentionalPoolAdapter

Attentional pooling connector (Molmo2 §3.1).

AvgPoolAdapter

Average-pool a square patch grid by a window, then project.

LinearAdapter

Single nn.Linear from image-feature dim to LLM embedding dim.

MLP2LayerAdapter

2-layer MLP from image-feature dim to LLM embedding dim.

VisionAdapter

Base class for vision→LLM adapters (the connector).

kempnerforge.model.adapter.pooled_token_count(num_input_tokens, window, *, require_divisible=False)[source]

Token count out of a window×window pool over a square patch grid.

A vision encoder emits num_input_tokens patch tokens laid out on a square grid × grid map (grid = sqrt(num_input_tokens)). Pooling with a window × window kernel and ceil edges yields ceil(grid/window) ** 2 tokens; edge windows that do not fill the kernel pool only the patches they cover (Molmo2 §A: “the bottom and far-right image patches are pooled with a reduced number of patches”).

Connectors that cannot pool ragged edges (require_divisible=True, e.g. attentional_pool) raise when grid is not divisible by window, so a ragged config is rejected at config/build time rather than deterministically failing in forward at the first step.

This is the single source of truth for the post-pool count: it must equal the pooling adapters’ actual forward output length, because the build path uses it to size MoT’s positional split.

Parameters:
  • num_input_tokens (int)

  • window (int)

  • require_divisible (bool)

Return type:

int

class kempnerforge.model.adapter.VisionAdapter[source]

Bases: Module

Base class for vision→LLM adapters (the connector).

Contract: forward maps (B, N, in_dim) -> (B, M, out_dim) where M == output_num_tokens(N). Projection adapters keep M == N; pooling adapters reduce it. output_num_tokens lets the build path size the residual stream and MoT’s positional split without a dry-run forward, and must agree exactly with the forward output length.

output_num_tokens(num_input_tokens)[source]

Tokens emitted per image given num_input_tokens patch tokens in.

Identity by default (projection adapters); pooling adapters override.

Parameters:

num_input_tokens (int)

Return type:

int

forward(x)[source]
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

class kempnerforge.model.adapter.MLP2LayerAdapter[source]

Bases: VisionAdapter

2-layer MLP from image-feature dim to LLM embedding dim.

Architecture: Linear(in_dim, hidden) -> activation -> Linear(hidden, out_dim). hidden_dim=None defaults to out_dim. Keeps the token count.

reset_parameters is provided so callers that materialize adapters from meta can re-initialize weights with the standard Linear defaults.

__init__(in_dim, out_dim, hidden_dim=None, activation='gelu')[source]
Parameters:
  • in_dim (int)

  • out_dim (int)

  • hidden_dim (int | None)

  • activation (str)

Return type:

None

reset_parameters()[source]

Re-run nn.Linear default init on both projections.

Used after to_empty(device=...) on a meta-device build.

Return type:

None

forward(x)[source]
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

class kempnerforge.model.adapter.LinearAdapter[source]

Bases: VisionAdapter

Single nn.Linear from image-feature dim to LLM embedding dim.

No activation, no hidden layer. Keeps the token count. Useful as an ablation baseline against MLP2LayerAdapter.

__init__(in_dim, out_dim)[source]
Parameters:
Return type:

None

reset_parameters()[source]
Return type:

None

forward(x)[source]
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

class kempnerforge.model.adapter.AvgPoolAdapter[source]

Bases: VisionAdapter

Average-pool a square patch grid by a window, then project.

(B, N, in_dim) patch tokens (N == grid**2) are averaged over window × window spatial windows (ceil edges; partial edge windows average only the real patches they cover), giving (B, M, in_dim) with M == ceil(grid/window)**2, then a Linear maps in_dim -> out_dim.

The cheapest token-count reducer (LLaVA-NeXT / sibling-repo style). window is overridable per forward call so one connector can pool images (e.g. 2×2) and video frames (3×3) with the same projection weights.

__init__(in_dim, out_dim, pool_window=2)[source]
Parameters:
  • in_dim (int)

  • out_dim (int)

  • pool_window (int)

Return type:

None

reset_parameters()[source]
Return type:

None

output_num_tokens(num_input_tokens)[source]

Tokens emitted per image given num_input_tokens patch tokens in.

Identity by default (projection adapters); pooling adapters override.

Parameters:

num_input_tokens (int)

Return type:

int

forward(x, pool_window=None)[source]
Parameters:
Return type:

torch.Tensor

class kempnerforge.model.adapter.AttentionalPoolAdapter[source]

Bases: VisionAdapter

Attentional pooling connector (Molmo2 §3.1).

For each window × window patch window, a multi-head attention layer pools the window’s patches into one vector, using the mean of the window’s patches as the query and the patches themselves as keys/values; the result is projected in_dim -> out_dim. Output length is ceil(grid/window)**2.

window is overridable per forward call (shared params across image 2×2 and video 3×3 pooling, per the paper). v1 requires the grid be divisible by the window (no ragged edge windows); ragged attentional pooling is a follow-up.

__init__(in_dim, out_dim, pool_window=2, pool_heads=16)[source]
Parameters:
  • in_dim (int)

  • out_dim (int)

  • pool_window (int)

  • pool_heads (int)

Return type:

None

reset_parameters()[source]
Return type:

None

output_num_tokens(num_input_tokens)[source]

Tokens emitted per image given num_input_tokens patch tokens in.

Identity by default (projection adapters); pooling adapters override.

Parameters:

num_input_tokens (int)

Return type:

int

forward(x, pool_window=None)[source]
Parameters:
Return type:

torch.Tensor

kempnerforge.model.adapter.build_adapter(adapter_config, in_dim, out_dim)[source]

Dispatch to the registered adapter builder.

Parameters:
  • adapter_configAdapterConfig (or compatible object exposing type and extra_kwargs()).

  • in_dim (int) – Source feature dim (the vision encoder’s feature_dim).

  • out_dim (int) – Target embedding dim (the transformer’s dim).

Returns:

A VisionAdapter with signature (B, N, in_dim) -> (B, M, out_dim), where M == adapter.output_num_tokens(N).

Return type:

VisionAdapter