Data¶
KempnerForge’s data pipeline: the dataset types, the sampler that partitions work across data-parallel ranks, the stateful dataloader that makes checkpoint resume possible, and the mixing / annealing hooks used for curriculum training.
Entry point: scripts/train.py picks one of four paths based on
config — pre-tokenized mmap, HuggingFace eager, HuggingFace streaming,
or multi-source mixture.
At a glance¶
Config path |
Dataset class |
Sampler |
Notes |
|---|---|---|---|
|
|
|
Pre-tokenized |
|
|
|
Eager download, tokenize once, pack into RAM. |
|
|
(none — |
On-the-fly tokenization from a streaming Hub dataset. |
|
|
|
Weighted mixing across multiple sources; phase transitions rebalance weights. |
All of the map-style datasets and samplers implement state_dict() /
load_state_dict(). StatefulDataLoader wraps them and tracks the
batch position inside an epoch so training can resume exactly after a
checkpoint — see Stateful dataloader.
Config¶
Everything here is driven by DataConfig (kempnerforge/config/data.py):
[data]
dataset_path = "" # pre-tokenized shard directory
file_pattern = "*.npy" # glob for shard files (also "*.bin")
tokenizer_path = "" # HF tokenizer name or local path
num_workers = 4
pin_memory = true
prefetch_factor = 2
pack_sequences = false # cross-doc label masking via EOS
# HuggingFace path
hf_dataset_name = "" # default: None (str | None)
hf_dataset_config = "" # default: None (str | None)
hf_dataset_split = "train"
hf_dataset_text_field = "text"
hf_streaming = false
# Multi-dataset mixing
datasets = [] # list of DatasetSource tables
mix_temperature = 1.0 # weight scaling (1.0 = as-is, >1 = more uniform)
# Phase scheduling (step-triggered weight + LR transitions)
phases = []
anneal_start_step = 0 # shortcut for a common 2-phase pattern
anneal_weights = {}
See Configuration § [data] — DataConfig
for the per-field reference.
Pages¶
See also¶
Checkpointing § Train state — dataloader state infrastructure exists but the shipped training loop doesn’t wire it into the save; resume restarts the epoch.
Training § Training loop — where the dataloader is consumed step by step.
Configuration §
[data]— DataConfig — every knob on this page.