Prepare tokenized data¶
KempnerForge reads two kinds of data: pre-tokenized binary shards on disk, or text streamed from HuggingFace. Pick the first for production runs (faster, deterministic, works offline); pick the second for small experiments and prototyping.
Path A: pre-tokenized shards¶
On-disk format¶
MemoryMappedDataset
reads a directory of shard files, each a flat 1-D array of token IDs:
File extension |
Encoding |
Dtype rule |
|---|---|---|
|
NumPy mmap |
dtype stored in header |
|
Raw packed integers |
inferred from size (4-byte multiple → uint32, else uint16), or from |
One file is one flat array. The dataset concatenates shards logically
and splits the concatenated stream into non-overlapping
seq_len-token samples. uint16 works for vocab ≤ 65,535 (GPT-2 at
50,257 and Llama-2 / Mistral at 32,000 fit; Llama-3 at 128,256 and
Gemma at 256,000 need uint32).
Tokenize with your tool of choice¶
KempnerForge does not ship a tokenizer and does not depend on
tatm. Any tokenizer that outputs .bin or .npy shards works.
The workflow scripts/prepare_data.py documents is the one used
internally at the Kempner Institute:
# Step 1 — tokenize (tatm is one of many tools that writes this format)
tatm tokenize --tokenizer meta-llama/Llama-3.2-1B \
--output-dir /path/to/tokenized/dataset \
<dataset-spec>
# Step 2 — validate
uv run python scripts/prepare_data.py /path/to/tokenized/dataset
Rolling your own: write flat arrays of uint16 / uint32 into .bin or
.npy files, optionally drop a metadata.yaml alongside them, and
point training at the directory.
Validate with prepare_data.py¶
uv run python scripts/prepare_data.py /path/to/tokenized/dataset
scripts/prepare_data.py
walks the directory, reports the shard count, size in GB, total token
count, dtype, and (if metadata.yaml is present) the tokenizer name
and vocab size. If the dtype is uint16 / uint32 the script prints the
exact train-time flags:
Compatible with KempnerForge MemoryMappedDataset.
Train with:
--data.dataset_path=/path/to/tokenized/dataset
--data.file_pattern='*.bin'
--model.vocab_size=128256
Wire those into your TOML:
[data]
dataset_path = "/path/to/tokenized/dataset"
file_pattern = "*.bin"
[model]
vocab_size = 128256 # must match what you tokenized with
Optional: metadata.yaml (tatm format)¶
The validator reads metadata.yaml if present. Schema used by
tatm:
tokenized_info:
dtype: uint32 # or uint16
tokenizer: meta-llama/Llama-3.2-1B
vocab_size: 128256
If you generate shards yourself, dropping this file in the directory
lets the validator print the right vocab_size flag and lets
.bin-format shards bypass the heuristic dtype inference. It’s
optional — without it the validator still runs.
Path B: HuggingFace streaming¶
For experiments on standard benchmark datasets, skip the pre-tokenize step and stream directly:
[data]
hf_dataset_name = "wikitext"
hf_dataset_config = "wikitext-103-raw-v1"
hf_dataset_split = "train"
hf_dataset_text_field = "text"
hf_streaming = true
tokenizer_path = "gpt2"
With hf_streaming = true,
StreamingHuggingFaceDataset
downloads and tokenizes on the fly during iteration. Each rank
processes every world_size-th document, so no coordination is
needed for distributed runs.
The tokenizer path is loaded through
transformers.AutoTokenizer.from_pretrained, so it can be a
HuggingFace hub ID ("gpt2", "meta-llama/Llama-3.2-1B") or a
local directory.
Cache the tokenizer first¶
Hugging Face cannot access the hub from compute nodes with no or limited internet connectivity. Since compute nodes also have much slower bandwidth (~1 Gbps vs. ~100 Gbps on login nodes), cache the tokenizer once on the login node:
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('gpt2')"
The tokenizer lands in ~/.cache/huggingface/ and the compute-node
run picks it up transparently. See
configs/train/hf_wikitext.toml
for the full HF-streaming reference config.
Non-streaming (eager) HF loading¶
hf_streaming = false switches to HuggingFaceDataset, which
downloads the full dataset and tokenizes in one pass before training
starts. Faster per-step once loaded, but costs memory and startup
time for large datasets. Use only for small benchmarks where the
whole dataset fits.
Which path to pick¶
Constraint |
Path |
|---|---|
Deterministic, reproducible runs |
Pre-tokenized (Path A) |
Multi-TB datasets |
Pre-tokenized (Path A) |
Air-gapped compute nodes |
Pre-tokenized (Path A) |
Quick experiment on a benchmark dataset |
HF streaming (Path B) |
Dataset bigger than disk |
HF streaming (Path B) |
Same data across reruns, offline |
Either — but pre-tokenize once |
Pre-tokenizing is strictly a one-time cost for strictly better runtime behavior (zero-copy mmap, exact resumption, deterministic sample order). HF streaming is optimized for exploration.
Mixed datasets¶
Both paths support weighted mixing via data.datasets:
[[data.datasets]]
name = "code"
path = "/data/tokenized/code"
weight = 0.3
[[data.datasets]]
name = "web"
hf_name = "HuggingFaceFW/fineweb"
hf_config = "sample-10BT"
weight = 0.7
See Data § Mixing and annealing for the mechanics and phase-transition recipes.
Sequence packing¶
pack_sequences = true packs multiple shorter documents into one
seq_len window, with a doc_ids tensor that prevents cross-document
attention. Useful when document lengths are much smaller than
seq_len; otherwise the default (one document per sample, truncated
or padded) is fine.
Resumption¶
Both paths work with checkpoint auto-resume, but via different mechanisms:
Pre-tokenized (Path A): wrapped in
StatefulDataLoader, which saves and restores the sampler’s position viaDistributedSampler.set_skip(). A job that preempts at step 3,500 picks up at sample3500 × batch_size × world_sizewithout replay.HF streaming (Path B): wrapped in a plain
DataLoader;StreamingHuggingFaceDatasettracks its own position and exposesload_state_dict/state_dict, whichscripts/train.pydrives directly. Same guarantee — no replay, no skip — via a different code path.
See also¶
Data § MemoryMappedDataset — the full
MemoryMappedDatasetAPI and the packing implementation.Data § StatefulDataLoader —
StatefulDataLoaderand resumption.Configuration §
[data]— everyDataConfigfield with its default.End-to-end training run — uses Path B (HF wikitext) as the runnable example.
configs/train/hf_wikitext.toml— reference HF-streaming config.