kempnerforge.config.vision

Vision-encoder configuration.

VisionEncoderConfig selects and parameterizes the vision encoder that the VLMWrapper composes alongside the text backbone and adapter. It is a top-level section in TOML ([vision_encoder]), sibling to [model], [adapter], and [vlm].

Field summary:

  • type selects the encoder by registry key (see registry.register_vision_encoder). Defaults to "random" for tests; production configs set "siglip2" / "clip" etc.

  • path is the HF Hub id or local path passed to the encoder builder. Empty string is accepted for stub encoders ("random").

  • feature_dim is the output feature dim of the encoder. 0 means “infer from the encoder at build time”.

  • num_tokens is the number of image tokens the encoder produces per image. 0 means “infer at build time”. When > 0 it is cross- checked against model.max_seq_len at config time inside JobConfig.__post_init__.

Classes

VisionEncoderConfig

Configuration for the vision encoder component of a VLM.

class kempnerforge.config.vision.VisionEncoderConfig[source]

Bases: object

Configuration for the vision encoder component of a VLM.

type: str = 'random'
path: str = ''
feature_dim: int = 0
num_tokens: int = 0
__init__(type='random', path='', feature_dim=0, num_tokens=0)
Parameters:
Return type:

None