kempnerforge.model.vision

Vision encoders for VLM training.

A vision encoder turns (B, 3, H, W) pixel values into a bag of (B, num_tokens, feature_dim) patch tokens that the VLM adapter maps into the language-model embedding space.

Encoders register themselves via registry.register_vision_encoder. Currently shipped:

  • random — small deterministic stub for tests and smoke configs. No network access required. Produces reproducible noise for a given seed.

  • siglip2 / clip — thin wrappers around HuggingFace AutoModel.from_pretrained. The HF imports are deferred so the module is importable on machines without the transformers package, and failures are surfaced with a clear message.

Classes

RandomVisionEncoder

Deterministic random-token stub.

VisionEncoder

Base class for vision encoders.

class kempnerforge.model.vision.VisionEncoder[source]

Bases: Module

Base class for vision encoders.

Subclasses must set feature_dim and num_tokens before returning from __init__ and implement forward(pixel_values) to produce a (B, num_tokens, feature_dim) tensor.

feature_dim: int
num_tokens: int
forward(pixel_values)[source]
Parameters:

pixel_values (torch.Tensor)

Return type:

torch.Tensor

class kempnerforge.model.vision.RandomVisionEncoder[source]

Bases: VisionEncoder

Deterministic random-token stub.

The output is computed from a hash of pixel_values.sum() so the same image produces the same tokens across calls; independent of model weights so it works under FSDP2 without sharding a real encoder.

Used in tests and the vlm_debug.toml smoke config.

__init__(num_tokens=16, feature_dim=768, seed=0)[source]
Parameters:
  • num_tokens (int)

  • feature_dim (int)

  • seed (int)

Return type:

None

forward(pixel_values)[source]
Parameters:

pixel_values (torch.Tensor)

Return type:

torch.Tensor