kempnerforge.data.video_dataset¶

Video dataset and collator for the VLM video path (WebVid-style layout).

WebVidVideoDataset reads a WebVid-style on-disk corpus — per-partition CSV manifests (videoid, name = caption) plus .mp4 files laid out under raw/videos/<split>/ — and produces the video analogue of the single-image VLMSample:

pixel_values: (F, 3, H, W) float tensor — F = max_frames frames, each resized/normalized exactly like the image path. Clips that yield fewer than F real frames are zero-padded.
frame_mask: (F,) bool — True for real frames, False for padding.
input_ids / labels: (T,) int64, right-padded to max_text_len, with -100 on pad/prompt positions. A clip that fails to decode contributes no loss (all labels -100) so noisy data never crashes training.

VideoCollator stacks samples into a fixed-shape batch (pixel_values: (B, F, 3, H, W), frame_mask: (B, F)) so every DP rank sees identical shapes under FSDP2.

Frame decoding lives in video_io.decode_video_frames and is imported at module scope so tests can substitute a stub; av itself is imported lazily inside the decoder.

Functions

build_video_dataset(video_config, ...)

Build the video dataset selected by video_config.dataset_type.

Classes

`VideoCollator`	Stack video samples into a fixed-shape batch.
`VideoDataset`	Base for video-caption datasets feeding the VLM video path.
`WebVidVideoDataset`	Map-style WebVid-style video-caption dataset for VLM training.

class kempnerforge.data.video_dataset.VideoDataset[source]¶

Bases: Dataset

Base for video-caption datasets feeding the VLM video path.

A subclass is a map-style Dataset whose __getitem__ returns the sample dict VideoCollator batches:

pixel_values: (F, 3, H, W) float32 (F = max_frames, zero-padded).
frame_mask: (F,) bool (True for real frames).
input_ids / labels: (T,) int64, padded to max_text_len with -100 on pad/prompt positions.

Register a new dataset style with @registry.register_video_dataset and select it via [video].dataset_type; build_video_dataset dispatches through the registry. WebVidVideoDataset is the WebVid-style layout (per-partition CSV manifests + prefix-nested .mp4 files); other styles (HuggingFace video sets, flat folders, alternate manifests) are follow-ups.

class kempnerforge.data.video_dataset.WebVidVideoDataset[source]¶

Bases: VideoDataset

Map-style WebVid-style video-caption dataset for VLM training.

Parameters:

data_root – Dataset root (contains raw/<dataset_name>/data and raw/videos).
split – "train" or "validation".
tokenizer_path – HF tokenizer id or local path.
max_text_len – Fixed-length text pad target.
fps (max_frames / min_frames /) – Frame-sampling knobs (see video_io).
frame_size – Square pixel size per frame.
max_samples – Cap the manifest (0 = all).
prompt – Optional instruction prepended and masked from the loss.
image_std (image_mean /) – Per-channel normalization (SigLIP defaults).

__init__(data_root, split, tokenizer_path, max_text_len, *, max_frames, min_frames, fps, frame_size=224, max_samples=0, prompt='', dataset_name='webvid-10M', sampling_policy='uniform', image_mean=(0.5, 0.5, 0.5), image_std=(0.5, 0.5, 0.5))[source]¶

Parameters:

data_root (str)
split (str)
tokenizer_path (str)
max_text_len (int)
max_frames (int)
min_frames (int)
fps (float)
frame_size (int)
max_samples (int)
prompt (str)
dataset_name (str)
sampling_policy (str)
image_mean (tuple[float, float, float])
image_std (tuple[float, float, float])

Return type:

None

class kempnerforge.data.video_dataset.VideoCollator[source]¶

Bases: object

Stack video samples into a fixed-shape batch.

Output keys:

pixel_values: (B, F, 3, H, W) float32.
frame_mask: (B, F) bool (True = real frame).
input_ids: (B, max_text_len) int64.
labels: (B, max_text_len) int64 with -100 on pad/prompt.

Text is always padded to max_text_len (never batch-max) so DP ranks see identical shapes under FSDP2, matching VLMCollator.

__init__(pad_id, max_text_len)[source]¶

Parameters:

pad_id (int)
max_text_len (int)

Return type:

None

kempnerforge.data.video_dataset.build_video_dataset(video_config, tokenizer_path, max_text_len)[source]¶

Build the video dataset selected by video_config.dataset_type.

Dispatches through the video_dataset registry, so a new dataset style is one @registry.register_video_dataset builder + a config string. The config is duck-typed to avoid a data->config import cycle.

Parameters:

video_config (Any)
tokenizer_path (str)
max_text_len (int)

Return type:

VideoDataset