kempnerforge.model.vision¶

Vision encoders for VLM training.

A vision encoder turns (B, 3, H, W) pixel values into a bag of (B, num_tokens, feature_dim) patch tokens that the VLM adapter maps into the language-model embedding space.

Encoders register themselves via registry.register_vision_encoder. Currently shipped:

random — small deterministic stub for tests and smoke configs. No network access required. Produces reproducible noise for a given seed.
siglip2 / clip — thin wrappers around HuggingFace AutoModel.from_pretrained. The HF imports are deferred so the module is importable on machines without the transformers package, and failures are surfaced with a clear message.

Classes

`RandomVisionEncoder`	Deterministic random-token stub.
`VisionEncoder`	Base class for vision encoders.

class kempnerforge.model.vision.VisionEncoder[source]¶

Bases: Module

Base class for vision encoders.

Subclasses must set feature_dim and num_tokens before returning from __init__ and implement forward(pixel_values) to produce a (B, num_tokens, feature_dim) tensor.

feature_dim: int¶

num_tokens: int¶

forward(pixel_values)[source]¶

Parameters:: pixel_values (torch.Tensor)
Return type:: torch.Tensor

class kempnerforge.model.vision.RandomVisionEncoder[source]¶

Bases: VisionEncoder

Deterministic random-token stub.

The output is computed from a hash of pixel_values.sum() so the same image produces the same tokens across calls; independent of model weights so it works under FSDP2 without sharding a real encoder.

Used in tests and the vlm_debug.toml smoke config.

__init__(num_tokens=16, feature_dim=768, seed=0)[source]¶

Parameters:

num_tokens (int)
feature_dim (int)
seed (int)

Return type:

None

forward(pixel_values)[source]¶

Parameters:: pixel_values (torch.Tensor)
Return type:: torch.Tensor