kempnerforge.config.video¶
Video input configuration.
VideoConfig is the [video] top-level section. When present, the job
trains on a video dataset through the VLM wrapper: a clip is decoded into an
ordered set of frames, each preprocessed like an image and fed to the vision
encoder. The section is a sibling of [vision_encoder] / [adapter] /
[vlm] and requires [vlm] to be set.
Frame-sampling defaults follow the Molmo2 paper (sample at fps per second,
include the first and last frame, cap at max_frames). max_frames is the
per-clip frame budget; the number of visual tokens it implies
(max_frames * tokens_per_frame) feeds the residual-stream / sequence-length
math once the model consumes video.
Classes
Video dataset location and frame-sampling knobs. |
- class kempnerforge.config.video.VideoConfig[source]¶
Bases:
objectVideo dataset location and frame-sampling knobs.
- Fields:
data_root: Root directory of the on-disk video dataset. dataset_type: Registry key for the dataset builder (
"webvid"default). dataset_name: On-disk corpus name within a style (e.g."webvid-10M"). sampling_policy: Registry key for the frame-sampling policy ("uniform"). split: Which split to read ("train"or"validation"). max_samples: Cap the manifest to this many examples (0= all). max_frames: Maximum frames sampled per clip (the per-clip budget). min_frames: Minimum frames sampled per clip; short clips pad up to this. fps: Target sampling rate in frames per second (Molmo2 uses 2). frame_size: Square pixel size each frame is resized to. prompt: Optional instruction prepended to the target text, masked from loss.
- __init__(data_root='', dataset_type='webvid', dataset_name='webvid-10M', sampling_policy='uniform', split='train', max_samples=0, max_frames=16, min_frames=4, fps=2.0, frame_size=224, prompt='')¶