kempnerforge.data.vlm_dataset¶
VLM dataset and collator (Joint-Decoder).
HuggingFaceVLMDataset wraps a HuggingFace image-text dataset and
produces the VLMSample contract:
pixel_values:(3, H, W)float tensor, resized toimage_sizeand normalized with the provided mean/std.input_ids:(T,)int64 tensor, right-padded tomax_text_len.labels:(T,)int64 tensor matchinginput_idswith-100on padding positions and (optionally) on prompt positions whenprompt_fieldis set.
VLMCollator stacks a list of samples into a batch. All batches are
padded to the same fixed max_text_len regardless of batch content so
different ranks see identical tensor shapes (no NCCL desync under
FSDP2). The collator also emits image_positions: (B,) zeros; this
slot is reserved for a future multi-image extension and is unused by
the Joint-Decoder wrapper today.
The HF datasets and transformers packages are imported lazily
so this module is safe to import without them (e.g. in unit tests that
don’t exercise the dataset path).
Classes
Map-style HF image-text dataset for Joint-Decoder training. |
|
Stack VLM samples into a fixed-length batch. |
- class kempnerforge.data.vlm_dataset.HuggingFaceVLMDataset[source]¶
Bases:
DatasetMap-style HF image-text dataset for Joint-Decoder training.
- Parameters:
dataset_name – HF dataset name (e.g.
"sayakpaul/coco-30-val-2014") or a local directory written bydatasets.save_to_disk.split – Dataset split.
image_field – Column name for the PIL image.
text_field – Column name for the caption / target text.
tokenizer_path – HF tokenizer id or local path.
max_text_len – Fixed-length pad target; passed to the collator.
prompt_field – Optional column name for a prompt that should NOT receive loss (e.g. the instruction in an instruction-tuned dataset). Prompt tokens get
labels=-100.image_size – Target square image size. Default 224.
image_std (image_mean /) – Normalization stats. Defaults match SigLIP’s
(0.5, 0.5, 0.5).dataset_config – HF dataset config name, if required.
- class kempnerforge.data.vlm_dataset.VLMCollator[source]¶
Bases:
objectStack VLM samples into a fixed-length batch.
- Output keys:
pixel_values:(B, 3, H, W).input_ids:(B, max_text_len)int64.labels:(B, max_text_len)int64 with-100on pad.image_positions:(B,)long tensor. Reserved slot for multi-image extensions; currently all zeros (single image per example placed at sequence position 0).
Padding is always to
max_text_len, never batch-max, so ranks always see identical tensor shapes under FSDP2.