Available configs¶
The
configs/train/
directory ships 17 training presets. Most are named
<model>_<gpu-count>_<parallelism>.toml so the filename doubles as a
recipe. The
configs/model/
directory ships 4 model-only presets used by
scripts/convert_checkpoint.py
for DCP ↔ HuggingFace conversion.
Dense training configs¶
File |
Purpose |
Model |
GPUs |
Parallelism |
Notes |
|---|---|---|---|---|---|
Tiny model, fast smoke test |
256-dim, 4-layer |
1+ |
FSDP ( |
100 steps, |
|
End-to-end HuggingFace streaming |
512-dim, 8-layer |
1+ |
FSDP ( |
GPT-2 tokenizer, |
|
General-purpose Llama-3 7B |
7B |
any |
FSDP ( |
100K steps, |
|
Baseline multi-node 7B |
7B |
32 |
pure FSDP |
2M tokens/step, simplest multi-node recipe |
|
7B with intra-node TP |
7B |
12 |
TP=4 × FSDP=3 |
TP within node (NVLink), FSDP across (IB) |
|
Long preemptible 7B run |
7B |
16 |
FSDP ( |
100K steps, 210B tokens, ckpt every 500 steps |
|
FP8 compute + FSDP2 float8 AG |
7B |
16 |
FSDP ( |
|
|
Muon optimizer + z-loss + chunked CE |
7B |
16 |
FSDP ( |
Tests Muon, chunked cross-entropy, z-loss together |
|
Full-stack validation run |
13B |
24 |
TP=4 × FSDP=6 |
WandB, profiling, eval, HF tokenizer, 1000 steps |
|
13B with pipeline parallel |
13B |
32 |
TP=4 × PP=2 × FSDP=4 |
40 layers → 20 per PP stage, |
|
Custom 29B sized for H200 140GB |
29B |
32 |
TP=4 × PP=2 × FSDP=4 |
|
|
70B without PP bubble |
70B |
32 |
TP=4 × FSDP=8 |
No PP — fits via FSDP sharding alone |
|
70B when memory is tight |
70B |
32 |
TP=4 × PP=4 × FSDP=2 |
80 layers → 20 per PP stage, less FSDP sharding |
MoE training configs¶
File |
Purpose |
Model |
GPUs |
Parallelism |
MoE |
|---|---|---|---|---|---|
Tiny MoE smoke test |
256-dim, 4-layer |
1+ |
FSDP ( |
4 experts, top-2, |
|
Saturate 2 nodes with MoE |
~4B total / 1.8B active |
8 |
TP=4 × FSDP=2 |
8 experts, top-2, |
|
24-GPU MoE stress test |
~7B total / 1.8B active |
24 |
TP=4 × FSDP=6 |
8 experts, top-2, |
|
MoE + Expert Parallel |
~4B total / 1.8B active |
32 |
TP=4 × EP=2 × FSDP=4 |
8 experts, top-2, all-to-all across IB |
See the MoE Expert Parallel benchmark for the numbers the last config produced.
Model-only configs¶
These don’t include training fields — they’re loaded by
scripts/convert_checkpoint.py
to describe the architecture when round-tripping checkpoints to
HuggingFace.
File |
Architecture |
|---|---|
|
|
|
|
|
|
|
Conventions¶
dp_shard = -1in a config means “fill the remaining mesh dimension with FSDP” — the loader resolves this toworld_size / (dp_replicate·tp·pp·cp·ep). Most single-dimension configs (7B, FP8, Muon, debug) use-1so the same config works on any GPU count.compile_model = falseis set on every MoE config. Routing produces data-dependent shapes that breaktorch.compile’s graph —JobConfig.validate(world_size)logs a warning (not an error) if you combine them.Paths in these configs (
dataset_path = "/path/to/...") are placeholders — replace them with a real tokenized shard directory before running. Thehf_wikitext.tomlconfig is the only one that runs end-to-end without path edits (it streams from the HF Hub).Short
max_steps(20–100) in the multi-node configs is a benchmark-sizing default, not a training budget. Override with--train.max_steps=…for real runs.
See also¶
Parallelism recipes — same data, indexed by (model, GPU count) rather than by filename.
Benchmarks — measured throughput for the configs that were benchmarked end-to-end.
Config sections — the fields every TOML key maps to.