kempnerforge.config.vision¶

Vision-encoder configuration.

VisionEncoderConfig selects and parameterizes the vision encoder that the VLMWrapper composes alongside the text backbone and adapter. It is a top-level section in TOML ([vision_encoder]), sibling to [model], [adapter], and [vlm].

Field summary:

type selects the encoder by registry key (see registry.register_vision_encoder). Defaults to "random" for tests; production configs set "siglip2" / "clip" etc.
path is the HF Hub id or local path passed to the encoder builder. Empty string is accepted for stub encoders ("random").
feature_dim is the output feature dim of the encoder. 0 means “infer from the encoder at build time”.
num_tokens is the number of image tokens the encoder produces per image. 0 means “infer at build time”. When > 0 it is cross- checked against model.max_seq_len at config time inside JobConfig.__post_init__.

Classes

VisionEncoderConfig

Configuration for the vision encoder component of a VLM.

class kempnerforge.config.vision.VisionEncoderConfig[source]¶

Bases: object

Configuration for the vision encoder component of a VLM.

type: str = 'random'¶

path: str = ''¶

feature_dim: int = 0¶

num_tokens: int = 0¶

__init__(type='random', path='', feature_dim=0, num_tokens=0)¶

Parameters:

type (str)
path (str)
feature_dim (int)
num_tokens (int)

Return type:

None