Train on video

The VLM path ingests video through the same wrapper, connectors, and fusion archs as images — a clip is just an ordered set of frames. This guide covers the data layout, the [video] config, the frame-sampling policy, and how all four archs consume a clip.

The model’s view of a clip

A clip of F frames becomes F × P′ visual tokens:

  1. Sample F frames from the video by timestamp (target fps, uniform, first and last frame always kept).

  2. Encode each frame with the frozen vision tower (e.g. SigLIP2), fold the frame axis into the batch so B×F frames run through the encoder once.

  3. Pool + project each frame with the connector — an avgpool or attentional_pool adapter reduces a grid×grid patch map to P′ = ceil(grid/window)² tokens per frame (e.g. SigLIP2 @224/patch16 → 14×14 → 49 tokens at pool_window=2).

  4. Fuse the resulting (B, F·P′, dim) visual tokens into the backbone the same way images are fused — so all four archs work unchanged:

    • joint_decoder / mot / moma: the F·P′ tokens prepend the text in the residual stream and are trimmed before the LM head.

    • cross_attention: the F·P′ tokens flow as K/V into the cross-attention blocks; the residual stays text-only (so it fits more frames per max_seq_len).

Temporal order is carried by frame order (sequential positions). Per-frame timestamp tokens and grounding outputs are a separate follow-up (see below).

Token budget

For the residual-stream archs (JD / MoT / MoMa):

max_frames × tokens_per_frame + max_text_len  ≤  model.max_seq_len

e.g. 8 frames × 49 + 64 text = 456 ≤ 576. Cross-attention only needs max_text_len max_seq_len (visual tokens are K/V, not in the residual). The build- and config-time checks enforce this and fail before any GPU work.

Configure it

A video run adds a [video] section (sibling of [vision_encoder] / [adapter] / [vlm]) and a token-reducing connector. See configs/train/vlm_video_webvid.toml for a complete example; the key parts:

[adapter]
type = "avgpool"          # or "attentional_pool"; pools patches per frame
pool_window = 2           # 14×14 grid -> 7×7 = 49 tokens/frame

[vlm]
arch = "joint_decoder"    # also: cross_attention | mot | moma

[video]
data_root = "/path/to/webvid-10m"
dataset_type = "webvid"      # registry key; add styles via @registry.register_video_dataset
dataset_name = "webvid-10M"  # corpus dir under raw/<dataset_name>/data (WebVid style)
sampling_policy = "uniform"  # registry key; the frame-sampling policy
split = "train"              # "train" | "validation"
fps = 2.0                    # target sampling rate
max_frames = 8               # per-clip frame budget
min_frames = 4
frame_size = 224
max_samples = 0              # 0 = full manifest; set small for a smoke

The dataset side is pluggable: dataset_type selects a builder from the video_dataset registry ("webvid" ships; other styles — HuggingFace video sets, flat folders, alternate manifests — register as small follow-ups and are selected here), and sampling_policy selects a registered frame-sampling policy ("uniform" = the Molmo2 default). The WebVid corpus directory is parameterized by dataset_name, so any WebVid-style dataset works, not just webvid-10M: CSV manifests under raw/<dataset_name>/data/<split>/partitions/ and .mp4 files under raw/videos/<split>/.

Decoding uses PyAV, an optional dependency (its wheel bundles FFmpeg, so no system FFmpeg is required): install it with uv sync --group video. It is imported lazily, so the package imports without av and only actual decoding requires it.

Launch

# 4-GPU video training (Joint-Decoder)
uv run torchrun --nproc_per_node=4 scripts/train.py configs/train/vlm_video_webvid.toml

# Quick smoke: no SigLIP download, a few clips, few steps
uv run torchrun --nproc_per_node=2 scripts/train.py configs/train/vlm_video_webvid.toml \
    --vision_encoder.type=random --vision_encoder.num_tokens=196 \
    --vision_encoder.feature_dim=768 --video.max_samples=256 --train.max_steps=20

To switch arch, change [vlm].arch in the config — everything else (frame sampling, connector, dataset) is identical. (arch is resolved at config-load time, so it is set in the TOML, not via a --vlm.arch= CLI override.)

Constraints and follow-ups

  • Causal attention; no per-frame timestamps yet — temporal order is frame order. Per-frame timestamp tokens + grounding (<points>/<tracks> outputs with point-F1 / track-J&F eval) are a follow-up.

  • Padded frames are masked from attention — short/undecodable clips pad to max_frames with blank frames, and the frame_mask is consumed so real tokens never attend to padded-frame visual tokens (MoMa also drops them from expert-choice routing); a NaN guard keeps an all-padded clip finite. It is a pure mask (no new checkpoint keys); image/text keep the FlashAttention-2 path. For the image-prefix arches (Joint-Decoder/MoT/MoMa), video self-attention always takes the explicit-mask SDPA path (FA2 disabled, a (B,1,S,S) mask built) even for fully-decoded clips — a deliberate compile/DP-friendly trade-off; recovering FA2 / FlexAttention is a follow-up. (Cross-Attention keeps FA2 on its text self-attention; it masks padded image K/V in the cross-attention blocks instead.) Remaining: MoT configured with an MoE FFN still routes padded tokens through the shared MoE (a “generic token-validity in MoE” follow-up).

  • Fixed F per batch keeps tensor shapes static (for torch.compile and DP-rank consistency); variable-length clips arrive with VLM sequence packing.

  • Long-context (many frames) is blocked on context-parallel being wired.