kempnerforge.data.dataset

Dataset implementations for KempnerForge.

Three dataset types:
  • MemoryMappedDataset: Pre-tokenized numpy files with zero-copy mmap access.

  • HuggingFaceDataset: HuggingFace datasets with eager loading and sequence packing.

  • StreamingHuggingFaceDataset: Streaming HuggingFace datasets for very large corpora that don’t fit in memory. On-the-fly tokenization with sequence packing.

All implement a stateful interface (state_dict / load_state_dict) for resumption after checkpoint loads.

Classes

HuggingFaceDataset

HuggingFace dataset with on-the-fly tokenization and sequence packing.

MemoryMappedDataset

Pre-tokenized dataset backed by memory-mapped numpy files.

MixtureDataset

Concatenates multiple datasets for weighted mixing.

StreamingHuggingFaceDataset

Streaming HuggingFace dataset with on-the-fly tokenization and packing.

class kempnerforge.data.dataset.MemoryMappedDataset[source]

Bases: Dataset

Pre-tokenized dataset backed by memory-mapped numpy files.

Expects .npy files containing 1D arrays of uint16/uint32 token IDs that have been pre-packed into fixed-length sequences.

File layout: each file stores a flat array of tokens. The dataset splits them into non-overlapping chunks of seq_len tokens. Multiple files are concatenated logically.

Parameters:
  • data_dir – Directory containing .npy token files.

  • seq_len – Sequence length (number of tokens per sample).

  • file_pattern – Glob pattern for data files.

__init__(data_dir, seq_len, file_pattern='*.npy', pack_sequences=False, eos_token_id=None)[source]
Parameters:
  • data_dir (str)

  • seq_len (int)

  • file_pattern (str)

  • pack_sequences (bool)

  • eos_token_id (int | None)

Return type:

None

state_dict()[source]

Return checkpoint state. Keys: epoch, total_samples.

Return type:

dict

load_state_dict(state)[source]

Restore from checkpoint. Only epoch is restored; sample count is derived.

Parameters:

state (dict)

Return type:

None

close()[source]

Release the underlying mmaps. Preferred path; do not rely on __del__.

Return type:

None

class kempnerforge.data.dataset.HuggingFaceDataset[source]

Bases: Dataset

HuggingFace dataset with on-the-fly tokenization and sequence packing.

Loads a HuggingFace dataset, tokenizes text on the fly, and packs multiple documents into fixed-length sequences (separated by EOS tokens).

Parameters:
  • dataset_name – HuggingFace dataset name (e.g., “allenai/c4”).

  • dataset_config – Optional config name (e.g., “wikitext-2-raw-v1”).

  • split – Dataset split (“train”, “validation”, etc.).

  • text_field – Name of the text column.

  • seq_len – Sequence length for packing.

  • tokenizer_path – Path or name for HuggingFace tokenizer.

__init__(dataset_name, split, text_field, seq_len, tokenizer_path, dataset_config=None, pack_sequences=False)[source]
Parameters:
  • dataset_name (str)

  • split (str)

  • text_field (str)

  • seq_len (int)

  • tokenizer_path (str)

  • dataset_config (str | None)

  • pack_sequences (bool)

Return type:

None

state_dict()[source]

Return checkpoint state. Keys: epoch, sample_idx, total_sequences.

Return type:

dict

load_state_dict(state)[source]

Restore from checkpoint. Restores epoch and sample_idx.

Parameters:

state (dict)

Return type:

None

class kempnerforge.data.dataset.StreamingHuggingFaceDataset[source]

Bases: IterableDataset

Streaming HuggingFace dataset with on-the-fly tokenization and packing.

For very large datasets that don’t fit in memory. Streams documents, tokenizes on the fly, and packs into fixed-length sequences.

Handles distributed training by sharding the document stream across ranks (each rank processes every world_size-th document).

Use directly with torch.utils.data.DataLoader (no sampler needed — IterableDataset handles its own distribution).

Parameters:
  • dataset_name – HuggingFace dataset name (e.g., “allenai/c4”).

  • split – Dataset split (“train”, “validation”, etc.).

  • text_field – Name of the text column.

  • seq_len – Sequence length for packing.

  • tokenizer_path – Path or name for HuggingFace tokenizer.

  • dataset_config – Optional config name (e.g., “wikitext-2-raw-v1”).

  • rank – Current distributed rank (for document sharding).

  • world_size – Total number of ranks.

  • seed – Random seed for shuffling.

  • shuffle_buffer_size – Number of examples to buffer for shuffling.

__init__(dataset_name, split, text_field, seq_len, tokenizer_path, dataset_config=None, rank=0, world_size=1, seed=42, shuffle_buffer_size=10000, pack_sequences=False)[source]
Parameters:
  • dataset_name (str)

  • split (str)

  • text_field (str)

  • seq_len (int)

  • tokenizer_path (str)

  • dataset_config (str | None)

  • rank (int)

  • world_size (int)

  • seed (int)

  • shuffle_buffer_size (int)

  • pack_sequences (bool)

Return type:

None

state_dict()[source]

Return checkpoint state. Keys: epoch, rank_docs_consumed.

Return type:

dict

load_state_dict(state)[source]

Restore from checkpoint. Sets skip count to fast-forward on next iteration.

Parameters:

state (dict)

Return type:

None

class kempnerforge.data.dataset.MixtureDataset[source]

Bases: Dataset

Concatenates multiple datasets for weighted mixing.

Global index space maps to sub-datasets via cumulative offsets. Each sample includes dataset_idx (integer) so the training loop can compute per-dataset metrics.

Parameters:
  • datasets – List of map-style datasets to mix.

  • names – Human-readable name per dataset (for metrics logging).

__init__(datasets, names)[source]
Parameters:
Return type:

None

property cumulative_sizes: list[int]

[0, len(ds0), len(ds0)+len(ds1), ...].

Type:

Cumulative dataset sizes

property dataset_names: list[str]
state_dict()[source]

Return per-sub-dataset checkpoint state.

Return type:

dict

load_state_dict(state)[source]

Restore per-sub-dataset state.

Parameters:

state (dict)

Return type:

None