kempnerforge.data.video_io

Video frame sampling and decoding for the VLM video path.

A clip is reduced to an ordered set of still frames that the VLM pipeline treats like a sequence of images. Two concerns live here:

  1. sample_timestampswhich timestamps to sample. This is the policy from the Molmo2 paper (§3.1, §A): sample at a target frame-rate fps, cap the total at max_frames (uniformly subsampling longer clips), and always include the first and last frame. Sampling is expressed in seconds rather than frame indices so it is robust to variable-fps video. This function is pure (no decoder dependency) and unit-tested directly.

  2. decode_video_frameshow to read those frames. Decoding uses PyAV (av), whose manylinux wheel bundles FFmpeg, so no system FFmpeg or matching CUDA libraries are required (torchcodec needs both). av is imported lazily so this module imports cleanly without it; only actual decoding requires the package.

Returned frames are PIL.Image objects so the caller can reuse the exact image preprocessing (_pil_to_tensor) used on the single-image path.

Functions

decode_video_frames(path, *, fps, ...[, ...])

Decode a clip into a list of sampled PIL.Image frames (RGB).

sample_timestamps(duration_s, fps, ...)

Timestamps (seconds) to sample from a clip of length duration_s.

kempnerforge.data.video_io.sample_timestamps(duration_s, fps, min_frames, max_frames)[source]

Timestamps (seconds) to sample from a clip of length duration_s.

Policy (Molmo2 §3.1/§A): aim for fps frames per second, clamp the count to [min_frames, max_frames], and lay the samples out uniformly over [0, duration_s] so the first frame (0.0) and last frame (duration_s) are always included. A non-positive duration (unknown or instantaneous) yields a single timestamp at the start.

Returns a strictly increasing list of length in [1, max_frames].

Parameters:
Return type:

list[float]

kempnerforge.data.video_io.decode_video_frames(path, *, fps, min_frames, max_frames, sampling_policy='uniform')[source]

Decode a clip into a list of sampled PIL.Image frames (RGB).

Frames are chosen by the registered sampling_policy (default "uniform" = sample_timestamps) and read in a single decode pass: each target timestamp is mapped to the first decoded frame at or after it (timestamps past the last frame map to the last frame, so the final frame is always returned). The returned list has length equal to the number of sampled timestamps (<= max_frames), or is empty when the file has no decodable video stream.

Raises whatever av raises on a missing/corrupt file; callers that train over noisy data should catch and substitute an empty clip.

Parameters:
Return type:

list[PILImage]