kempnerforge.data.video_dataset

Video dataset and collator for the VLM video path (WebVid-style layout).

WebVidVideoDataset reads a WebVid-style on-disk corpus — per-partition CSV manifests (videoid, name = caption) plus .mp4 files laid out under raw/videos/<split>/ — and produces the video analogue of the single-image VLMSample:

  • pixel_values: (F, 3, H, W) float tensor — F = max_frames frames, each resized/normalized exactly like the image path. Clips that yield fewer than F real frames are zero-padded.

  • frame_mask: (F,) bool — True for real frames, False for padding.

  • input_ids / labels: (T,) int64, right-padded to max_text_len, with -100 on pad/prompt positions. A clip that fails to decode contributes no loss (all labels -100) so noisy data never crashes training.

VideoCollator stacks samples into a fixed-shape batch (pixel_values: (B, F, 3, H, W), frame_mask: (B, F)) so every DP rank sees identical shapes under FSDP2.

Frame decoding lives in video_io.decode_video_frames and is imported at module scope so tests can substitute a stub; av itself is imported lazily inside the decoder.

Functions

build_video_dataset(video_config, ...)

Build the video dataset selected by video_config.dataset_type.

Classes

VideoCollator

Stack video samples into a fixed-shape batch.

VideoDataset

Base for video-caption datasets feeding the VLM video path.

WebVidVideoDataset

Map-style WebVid-style video-caption dataset for VLM training.

class kempnerforge.data.video_dataset.VideoDataset[source]

Bases: Dataset

Base for video-caption datasets feeding the VLM video path.

A subclass is a map-style Dataset whose __getitem__ returns the sample dict VideoCollator batches:

  • pixel_values: (F, 3, H, W) float32 (F = max_frames, zero-padded).

  • frame_mask: (F,) bool (True for real frames).

  • input_ids / labels: (T,) int64, padded to max_text_len with -100 on pad/prompt positions.

Register a new dataset style with @registry.register_video_dataset and select it via [video].dataset_type; build_video_dataset dispatches through the registry. WebVidVideoDataset is the WebVid-style layout (per-partition CSV manifests + prefix-nested .mp4 files); other styles (HuggingFace video sets, flat folders, alternate manifests) are follow-ups.

class kempnerforge.data.video_dataset.WebVidVideoDataset[source]

Bases: VideoDataset

Map-style WebVid-style video-caption dataset for VLM training.

Parameters:
  • data_root – Dataset root (contains raw/<dataset_name>/data and raw/videos).

  • split"train" or "validation".

  • tokenizer_path – HF tokenizer id or local path.

  • max_text_len – Fixed-length text pad target.

  • fps (max_frames / min_frames /) – Frame-sampling knobs (see video_io).

  • frame_size – Square pixel size per frame.

  • max_samples – Cap the manifest (0 = all).

  • prompt – Optional instruction prepended and masked from the loss.

  • image_std (image_mean /) – Per-channel normalization (SigLIP defaults).

__init__(data_root, split, tokenizer_path, max_text_len, *, max_frames, min_frames, fps, frame_size=224, max_samples=0, prompt='', dataset_name='webvid-10M', sampling_policy='uniform', image_mean=(0.5, 0.5, 0.5), image_std=(0.5, 0.5, 0.5))[source]
Parameters:
Return type:

None

class kempnerforge.data.video_dataset.VideoCollator[source]

Bases: object

Stack video samples into a fixed-shape batch.

Output keys:
  • pixel_values: (B, F, 3, H, W) float32.

  • frame_mask: (B, F) bool (True = real frame).

  • input_ids: (B, max_text_len) int64.

  • labels: (B, max_text_len) int64 with -100 on pad/prompt.

Text is always padded to max_text_len (never batch-max) so DP ranks see identical shapes under FSDP2, matching VLMCollator.

__init__(pad_id, max_text_len)[source]
Parameters:
  • pad_id (int)

  • max_text_len (int)

Return type:

None

kempnerforge.data.video_dataset.build_video_dataset(video_config, tokenizer_path, max_text_len)[source]

Build the video dataset selected by video_config.dataset_type.

Dispatches through the video_dataset registry, so a new dataset style is one @registry.register_video_dataset builder + a config string. The config is duck-typed to avoid a data->config import cycle.

Parameters:
  • video_config (Any)

  • tokenizer_path (str)

  • max_text_len (int)

Return type:

VideoDataset