kempnerforge.data.dataset¶
Dataset implementations for KempnerForge.
- Three dataset types:
MemoryMappedDataset: Pre-tokenized numpy files with zero-copy mmap access.
HuggingFaceDataset: HuggingFace datasets with eager loading and sequence packing.
StreamingHuggingFaceDataset: Streaming HuggingFace datasets for very large corpora that don’t fit in memory. On-the-fly tokenization with sequence packing.
All implement a stateful interface (state_dict / load_state_dict) for resumption after checkpoint loads.
Classes
HuggingFace dataset with on-the-fly tokenization and sequence packing. |
|
Pre-tokenized dataset backed by memory-mapped numpy files. |
|
Concatenates multiple datasets for weighted mixing. |
|
Streaming HuggingFace dataset with on-the-fly tokenization and packing. |
- class kempnerforge.data.dataset.MemoryMappedDataset[source]¶
Bases:
DatasetPre-tokenized dataset backed by memory-mapped numpy files.
Expects .npy files containing 1D arrays of uint16/uint32 token IDs that have been pre-packed into fixed-length sequences.
File layout: each file stores a flat array of tokens. The dataset splits them into non-overlapping chunks of
seq_lentokens. Multiple files are concatenated logically.- Parameters:
data_dir – Directory containing .npy token files.
seq_len – Sequence length (number of tokens per sample).
file_pattern – Glob pattern for data files.
- class kempnerforge.data.dataset.HuggingFaceDataset[source]¶
Bases:
DatasetHuggingFace dataset with on-the-fly tokenization and sequence packing.
Loads a HuggingFace dataset, tokenizes text on the fly, and packs multiple documents into fixed-length sequences (separated by EOS tokens).
- Parameters:
dataset_name – HuggingFace dataset name (e.g., “allenai/c4”).
dataset_config – Optional config name (e.g., “wikitext-2-raw-v1”).
split – Dataset split (“train”, “validation”, etc.).
text_field – Name of the text column.
seq_len – Sequence length for packing.
tokenizer_path – Path or name for HuggingFace tokenizer.
- __init__(dataset_name, split, text_field, seq_len, tokenizer_path, dataset_config=None, pack_sequences=False)[source]¶
- class kempnerforge.data.dataset.StreamingHuggingFaceDataset[source]¶
Bases:
IterableDatasetStreaming HuggingFace dataset with on-the-fly tokenization and packing.
For very large datasets that don’t fit in memory. Streams documents, tokenizes on the fly, and packs into fixed-length sequences.
Handles distributed training by sharding the document stream across ranks (each rank processes every world_size-th document).
Use directly with
torch.utils.data.DataLoader(no sampler needed — IterableDataset handles its own distribution).- Parameters:
dataset_name – HuggingFace dataset name (e.g., “allenai/c4”).
split – Dataset split (“train”, “validation”, etc.).
text_field – Name of the text column.
seq_len – Sequence length for packing.
tokenizer_path – Path or name for HuggingFace tokenizer.
dataset_config – Optional config name (e.g., “wikitext-2-raw-v1”).
rank – Current distributed rank (for document sharding).
world_size – Total number of ranks.
seed – Random seed for shuffling.
shuffle_buffer_size – Number of examples to buffer for shuffling.
- class kempnerforge.data.dataset.MixtureDataset[source]¶
Bases:
DatasetConcatenates multiple datasets for weighted mixing.
Global index space maps to sub-datasets via cumulative offsets. Each sample includes
dataset_idx(integer) so the training loop can compute per-dataset metrics.- Parameters:
datasets – List of map-style datasets to mix.
names – Human-readable name per dataset (for metrics logging).
- __init__(datasets, names)[source]¶
- Parameters:
datasets (list[torch.utils.data.Dataset])
- Return type:
None