kempnerforge.data.video_dataset¶
Video dataset and collator for the VLM video path (WebVid-style layout).
WebVidVideoDataset reads a WebVid-style on-disk corpus — per-partition CSV
manifests (videoid, name = caption) plus .mp4 files laid out under
raw/videos/<split>/ — and produces the video analogue of the single-image
VLMSample:
pixel_values:(F, 3, H, W)float tensor —F = max_framesframes, each resized/normalized exactly like the image path. Clips that yield fewer thanFreal frames are zero-padded.frame_mask:(F,)bool —Truefor real frames,Falsefor padding.input_ids/labels:(T,)int64, right-padded tomax_text_len, with-100on pad/prompt positions. A clip that fails to decode contributes no loss (all labels-100) so noisy data never crashes training.
VideoCollator stacks samples into a fixed-shape batch
(pixel_values: (B, F, 3, H, W), frame_mask: (B, F)) so every DP rank
sees identical shapes under FSDP2.
Frame decoding lives in video_io.decode_video_frames and is imported at
module scope so tests can substitute a stub; av itself is imported lazily
inside the decoder.
Functions
|
Build the video dataset selected by |
Classes
Stack video samples into a fixed-shape batch. |
|
Base for video-caption datasets feeding the VLM video path. |
|
Map-style WebVid-style video-caption dataset for VLM training. |
- class kempnerforge.data.video_dataset.VideoDataset[source]¶
Bases:
DatasetBase for video-caption datasets feeding the VLM video path.
A subclass is a map-style
Datasetwhose__getitem__returns the sample dictVideoCollatorbatches:pixel_values:(F, 3, H, W)float32 (F = max_frames, zero-padded).frame_mask:(F,)bool (Truefor real frames).input_ids/labels:(T,)int64, padded tomax_text_lenwith-100on pad/prompt positions.
Register a new dataset style with
@registry.register_video_datasetand select it via[video].dataset_type;build_video_datasetdispatches through the registry.WebVidVideoDatasetis the WebVid-style layout (per-partition CSV manifests + prefix-nested.mp4files); other styles (HuggingFace video sets, flat folders, alternate manifests) are follow-ups.
- class kempnerforge.data.video_dataset.WebVidVideoDataset[source]¶
Bases:
VideoDatasetMap-style WebVid-style video-caption dataset for VLM training.
- Parameters:
data_root – Dataset root (contains
raw/<dataset_name>/dataandraw/videos).split –
"train"or"validation".tokenizer_path – HF tokenizer id or local path.
max_text_len – Fixed-length text pad target.
fps (max_frames / min_frames /) – Frame-sampling knobs (see
video_io).frame_size – Square pixel size per frame.
max_samples – Cap the manifest (
0= all).prompt – Optional instruction prepended and masked from the loss.
image_std (image_mean /) – Per-channel normalization (SigLIP defaults).
- class kempnerforge.data.video_dataset.VideoCollator[source]¶
Bases:
objectStack video samples into a fixed-shape batch.
- Output keys:
pixel_values:(B, F, 3, H, W)float32.frame_mask:(B, F)bool (True= real frame).input_ids:(B, max_text_len)int64.labels:(B, max_text_len)int64 with-100on pad/prompt.
Text is always padded to
max_text_len(never batch-max) so DP ranks see identical shapes under FSDP2, matchingVLMCollator.
- kempnerforge.data.video_dataset.build_video_dataset(video_config, tokenizer_path, max_text_len)[source]¶
Build the video dataset selected by
video_config.dataset_type.Dispatches through the
video_datasetregistry, so a new dataset style is one@registry.register_video_datasetbuilder + a config string. The config is duck-typed to avoid a data->config import cycle.- Parameters:
- Return type: