kempnerforge.model.vision¶
Vision encoders for VLM training.
A vision encoder turns (B, 3, H, W) pixel values into a bag of
(B, num_tokens, feature_dim) patch tokens that the VLM adapter maps
into the language-model embedding space.
Encoders register themselves via registry.register_vision_encoder.
Currently shipped:
random— small deterministic stub for tests and smoke configs. No network access required. Produces reproducible noise for a given seed.siglip2/clip— thin wrappers around HuggingFaceAutoModel.from_pretrained. The HF imports are deferred so the module is importable on machines without thetransformerspackage, and failures are surfaced with a clear message.
Classes
Deterministic random-token stub. |
|
Base class for vision encoders. |
- class kempnerforge.model.vision.VisionEncoder[source]¶
Bases:
ModuleBase class for vision encoders.
Subclasses must set
feature_dimandnum_tokensbefore returning from__init__and implementforward(pixel_values)to produce a(B, num_tokens, feature_dim)tensor.- forward(pixel_values)[source]¶
- Parameters:
pixel_values (torch.Tensor)
- Return type:
- class kempnerforge.model.vision.RandomVisionEncoder[source]¶
Bases:
VisionEncoderDeterministic random-token stub.
The output is computed from a hash of
pixel_values.sum()so the same image produces the same tokens across calls; independent of model weights so it works under FSDP2 without sharding a real encoder.Used in tests and the
vlm_debug.tomlsmoke config.- forward(pixel_values)[source]¶
- Parameters:
pixel_values (torch.Tensor)
- Return type: