kempnerforge.data.vlm_dataset

VLM dataset and collator (Joint-Decoder).

HuggingFaceVLMDataset wraps a HuggingFace image-text dataset and produces the VLMSample contract:

  • pixel_values: (3, H, W) float tensor, resized to image_size and normalized with the provided mean/std.

  • input_ids: (T,) int64 tensor, right-padded to max_text_len.

  • labels: (T,) int64 tensor matching input_ids with -100 on padding positions and (optionally) on prompt positions when prompt_field is set.

VLMCollator stacks a list of samples into a batch. All batches are padded to the same fixed max_text_len regardless of batch content so different ranks see identical tensor shapes (no NCCL desync under FSDP2). The collator also emits image_positions: (B,) zeros; this slot is reserved for a future multi-image extension and is unused by the Joint-Decoder wrapper today.

The HF datasets and transformers packages are imported lazily so this module is safe to import without them (e.g. in unit tests that don’t exercise the dataset path).

Classes

HuggingFaceVLMDataset

Map-style HF image-text dataset for Joint-Decoder training.

VLMCollator

Stack VLM samples into a fixed-length batch.

class kempnerforge.data.vlm_dataset.HuggingFaceVLMDataset[source]

Bases: Dataset

Map-style HF image-text dataset for Joint-Decoder training.

Parameters:
  • dataset_name – HF dataset name (e.g. "sayakpaul/coco-30-val-2014") or a local directory written by datasets.save_to_disk.

  • split – Dataset split.

  • image_field – Column name for the PIL image.

  • text_field – Column name for the caption / target text.

  • tokenizer_path – HF tokenizer id or local path.

  • max_text_len – Fixed-length pad target; passed to the collator.

  • prompt_field – Optional column name for a prompt that should NOT receive loss (e.g. the instruction in an instruction-tuned dataset). Prompt tokens get labels=-100.

  • image_size – Target square image size. Default 224.

  • image_std (image_mean /) – Normalization stats. Defaults match SigLIP’s (0.5, 0.5, 0.5).

  • dataset_config – HF dataset config name, if required.

__init__(dataset_name, split, image_field, text_field, tokenizer_path, max_text_len, prompt_field=None, image_size=224, image_mean=(0.5, 0.5, 0.5), image_std=(0.5, 0.5, 0.5), dataset_config=None)[source]
Parameters:
Return type:

None

class kempnerforge.data.vlm_dataset.VLMCollator[source]

Bases: object

Stack VLM samples into a fixed-length batch.

Output keys:
  • pixel_values: (B, 3, H, W).

  • input_ids: (B, max_text_len) int64.

  • labels: (B, max_text_len) int64 with -100 on pad.

  • image_positions: (B,) long tensor. Reserved slot for multi-image extensions; currently all zeros (single image per example placed at sequence position 0).

Padding is always to max_text_len, never batch-max, so ranks always see identical tensor shapes under FSDP2.

__init__(pad_id, max_text_len)[source]
Parameters:
  • pad_id (int)

  • max_text_len (int)

Return type:

None