kempnerforge.data.dataset¶

Dataset implementations for KempnerForge.

Three dataset types:

MemoryMappedDataset: Pre-tokenized numpy files with zero-copy mmap access.
HuggingFaceDataset: HuggingFace datasets with eager loading and sequence packing.
StreamingHuggingFaceDataset: Streaming HuggingFace datasets for very large corpora that don’t fit in memory. On-the-fly tokenization with sequence packing.

All implement a stateful interface (state_dict / load_state_dict) for resumption after checkpoint loads.

Classes

`HuggingFaceDataset`	HuggingFace dataset with on-the-fly tokenization and sequence packing.
`MemoryMappedDataset`	Pre-tokenized dataset backed by memory-mapped numpy files.
`MixtureDataset`	Concatenates multiple datasets for weighted mixing.
`StreamingHuggingFaceDataset`	Streaming HuggingFace dataset with on-the-fly tokenization and packing.

class kempnerforge.data.dataset.MemoryMappedDataset[source]¶

Bases: Dataset

Pre-tokenized dataset backed by memory-mapped numpy files.

Expects .npy files containing 1D arrays of uint16/uint32 token IDs that have been pre-packed into fixed-length sequences.

File layout: each file stores a flat array of tokens. The dataset splits them into non-overlapping chunks of seq_len tokens. Multiple files are concatenated logically.

Parameters:

data_dir – Directory containing .npy token files.
seq_len – Sequence length (number of tokens per sample).
file_pattern – Glob pattern for data files.

__init__(data_dir, seq_len, file_pattern='*.npy', pack_sequences=False, eos_token_id=None)[source]¶

Parameters:

data_dir (str)
seq_len (int)
file_pattern (str)
pack_sequences (bool)
eos_token_id (int | None)

Return type:

None

state_dict()[source]¶

Return checkpoint state. Keys: epoch, total_samples.

Return type:: dict

load_state_dict(state)[source]¶

Restore from checkpoint. Only epoch is restored; sample count is derived.

Parameters:: state (dict)
Return type:: None

close()[source]¶

Release the underlying mmaps. Preferred path; do not rely on __del__.

Return type:: None

class kempnerforge.data.dataset.HuggingFaceDataset[source]¶

Bases: Dataset

HuggingFace dataset with on-the-fly tokenization and sequence packing.

Loads a HuggingFace dataset, tokenizes text on the fly, and packs multiple documents into fixed-length sequences (separated by EOS tokens).

Parameters:

dataset_name – HuggingFace dataset name (e.g., “allenai/c4”).
dataset_config – Optional config name (e.g., “wikitext-2-raw-v1”).
split – Dataset split (“train”, “validation”, etc.).
text_field – Name of the text column.
seq_len – Sequence length for packing.
tokenizer_path – Path or name for HuggingFace tokenizer.

__init__(dataset_name, split, text_field, seq_len, tokenizer_path, dataset_config=None, pack_sequences=False)[source]¶

Parameters:

dataset_name (str)
split (str)
text_field (str)
seq_len (int)
tokenizer_path (str)
dataset_config (str | None)
pack_sequences (bool)

Return type:

None

state_dict()[source]¶

Return checkpoint state. Keys: epoch, sample_idx, total_sequences.

Return type:: dict

load_state_dict(state)[source]¶

Restore from checkpoint. Restores epoch and sample_idx.

Parameters:: state (dict)
Return type:: None

class kempnerforge.data.dataset.StreamingHuggingFaceDataset[source]¶

Bases: IterableDataset

Streaming HuggingFace dataset with on-the-fly tokenization and packing.

For very large datasets that don’t fit in memory. Streams documents, tokenizes on the fly, and packs into fixed-length sequences.

Handles distributed training by sharding the document stream across ranks (each rank processes every world_size-th document).

Use directly with torch.utils.data.DataLoader (no sampler needed — IterableDataset handles its own distribution).

Parameters:

dataset_name – HuggingFace dataset name (e.g., “allenai/c4”).
split – Dataset split (“train”, “validation”, etc.).
text_field – Name of the text column.
seq_len – Sequence length for packing.
tokenizer_path – Path or name for HuggingFace tokenizer.
dataset_config – Optional config name (e.g., “wikitext-2-raw-v1”).
rank – Current distributed rank (for document sharding).
world_size – Total number of ranks.
seed – Random seed for shuffling.
shuffle_buffer_size – Number of examples to buffer for shuffling.

__init__(dataset_name, split, text_field, seq_len, tokenizer_path, dataset_config=None, rank=0, world_size=1, seed=42, shuffle_buffer_size=10000, pack_sequences=False)[source]¶

Parameters:

dataset_name (str)
split (str)
text_field (str)
seq_len (int)
tokenizer_path (str)
dataset_config (str | None)
rank (int)
world_size (int)
seed (int)
shuffle_buffer_size (int)
pack_sequences (bool)

Return type:

None

state_dict()[source]¶

Return checkpoint state. Keys: epoch, rank_docs_consumed.

Return type:: dict

load_state_dict(state)[source]¶

Restore from checkpoint. Sets skip count to fast-forward on next iteration.

Parameters:: state (dict)
Return type:: None

class kempnerforge.data.dataset.MixtureDataset[source]¶

Bases: Dataset

Concatenates multiple datasets for weighted mixing.

Global index space maps to sub-datasets via cumulative offsets. Each sample includes dataset_idx (integer) so the training loop can compute per-dataset metrics.

Parameters:

datasets – List of map-style datasets to mix.
names – Human-readable name per dataset (for metrics logging).

__init__(datasets, names)[source]¶

Parameters:

datasets (list[torch.utils.data.Dataset])
names (list[str])

Return type:

None

property cumulative_sizes: list[int]¶

[0, len(ds0), len(ds0)+len(ds1), ...].

Type:: Cumulative dataset sizes

property dataset_names: list[str]¶

state_dict()[source]¶

Return per-sub-dataset checkpoint state.

Return type:: dict

load_state_dict(state)[source]¶

Restore per-sub-dataset state.

Parameters:: state (dict)
Return type:: None