Config sections¶
JobConfig aggregates ten typed sub-configs. Each one lives in its own
module under
kempnerforge/config/
and declares its fields with dataclass defaults. TOML sections map
one-to-one onto these dataclass attributes:
[model] # → config.model (ModelConfig)
[train] # → config.train (TrainConfig)
[optimizer] # → config.optimizer
[scheduler] # → config.scheduler
[data] # → config.data
[eval] # → config.eval
[distributed] # → config.distributed
[checkpoint] # → config.checkpoint
[metrics] # → config.metrics
[profiling] # → config.profiling
JobConfig¶
Owns the ten sub-configs and the cross-section validate method.
kempnerforge/config/job.py.
Field |
Type |
Purpose |
|---|---|---|
|
|
architecture + MoE knobs |
|
|
loop-level hyperparameters |
|
|
registry key + LR + betas + optimizer-specific knobs |
|
|
LR schedule shape and warmup |
|
|
dataset sources, mixing, annealing |
|
|
in-loop eval cadence and data source |
|
|
parallelism dims, NCCL timeout |
|
|
DCP save cadence, retention, resume path |
|
|
logging cadence and backends |
|
|
torch.profiler window and trace dir |
[model] — ModelConfig¶
Architecture hyperparameters and MoE knobs.
kempnerforge/config/model.py.
Dense¶
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
hidden size |
|
|
|
number of transformer blocks |
|
|
|
attention heads |
|
|
|
GQA: |
|
|
|
embedding table size |
|
|
|
scales Llama-style |
|
|
|
hard-override the computed FFN width |
|
|
|
registry key for norm builder |
|
|
|
norm epsilon |
|
|
|
MLP activation ( |
|
|
|
RoPE table length; |
|
|
|
RoPE frequency base |
|
|
|
share embedding and output-head weight |
|
|
|
RMSNorm over Q/K per head before RoPE |
|
|
|
weight-init std (GPT-2 / Llama convention) |
|
|
|
|
|
|
|
one of |
MoE (all defaults produce a dense model)¶
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
|
|
|
|
experts selected per token |
|
|
|
MoE every N layers (1=all, 2=alternating) |
|
|
|
|
|
|
|
shared experts that always process every token |
|
|
|
coefficient in training loss |
|
|
|
|
|
|
|
sequence-level balance loss (0 = off) |
|
|
|
per-expert gradient normalization |
|
|
|
|
|
|
|
pack expert weights into one tensor per projection |
Computed properties: is_moe, head_dim, computed_ffn_hidden_dim,
num_params_estimate.
[train] — TrainConfig¶
Training-loop hyperparameters.
kempnerforge/config/training.py.
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
per-device micro-batch size |
|
|
|
tokens per sequence |
|
|
|
training-loop termination |
|
|
|
microbatches per optimizer step |
|
|
|
|
|
|
|
torch/numpy/python RNG seed |
|
|
|
wrap the model with |
|
|
|
master-weight dtype; |
|
|
|
AC policy |
|
|
|
|
|
|
|
logit-magnitude regularizer (PaLM uses |
|
|
|
chunk size for |
|
|
|
graceful shutdown timeout before forced exit |
|
|
|
NCCL liveness all-reduce every N steps ( |
Computed properties: param_dtype, is_fp8.
[optimizer] — OptimizerConfig¶
Optimizer settings. name picks the registry builder; the other fields
are shared (AdamW/Lion) or optimizer-specific.
kempnerforge/config/optimizer.py.
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
one of |
|
|
|
peak learning rate |
|
|
|
L2 on 2-D params |
|
|
|
AdamW / Lion momenta |
|
|
|
numerical safety (AdamW) |
|
|
|
use fused AdamW when available |
|
|
|
Muon momentum coefficient |
|
|
|
Newton–Schulz iterations for Muon |
|
|
|
LR for 1-D params in Muon’s AdamW fallback; |
|
|
|
internal warmup for schedule-free |
[scheduler] — SchedulerConfig¶
LR schedule shape and warmup.
kempnerforge/config/scheduler.py.
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
|
|
|
|
linear warmup length |
|
|
|
|
|
|
|
floor = |
|
|
|
WSD: steps at constant LR between warmup and decay |
|
|
|
WSD cooldown shape |
|
|
|
REX exponent: |
[data] — DataConfig¶
Single dataset, HuggingFace source, or mixture; optional phase schedule.
kempnerforge/config/data.py.
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
directory of pre-tokenized shards |
|
|
|
glob inside |
|
|
|
path or HF id for the tokenizer |
|
|
|
DataLoader workers |
|
|
|
DataLoader pin memory |
|
|
|
DataLoader prefetch factor |
|
|
|
HF dataset id (e.g. |
|
|
|
HF dataset config (e.g. |
|
|
|
HF split |
|
|
|
field to tokenize |
|
|
|
use |
|
|
|
document-aware packing with cross-doc isolation (feeds |
|
|
|
multi-dataset mixture (overrides |
|
|
|
weight scaling; |
|
|
|
multi-phase schedule with weight/LR transitions |
|
|
|
syntactic sugar for a common 2-phase annealing pattern ( |
|
|
|
per-dataset weights applied at |
DatasetSource¶
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
pre-tokenized directory |
|
|
|
relative sampling weight (must be |
|
|
|
name for per-dataset metrics (auto-derived if empty) |
|
|
|
HF dataset id |
|
|
|
HF dataset config |
Either path or hf_name must be set per source.
TrainingPhase¶
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
step at which the phase activates |
|
|
|
per-dataset weights for this phase |
|
|
|
multiplier applied to scheduler LR |
phases[*].start_step must be strictly increasing; phases and
anneal_start_step are mutually exclusive.
[eval] — EvalConfig¶
In-loop evaluation. Disabled by default.
kempnerforge/config/eval.py.
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
gate in-loop eval |
|
|
|
eval every N training steps |
|
|
|
eval batches per evaluation |
|
|
|
pre-tokenized eval shards |
|
|
|
glob inside |
|
|
|
HF dataset id |
|
|
|
HF dataset config |
|
|
|
HF split |
If enabled=True, at least one of dataset_path / hf_dataset_name
must be set; validate() rejects the combination otherwise.
[distributed] — DistributedConfig¶
Parallelism dimensions and NCCL settings.
kempnerforge/config/distributed.py.
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
FSDP shard degree; |
|
|
|
DDP-style replication over FSDP groups |
|
|
|
tensor parallel |
|
|
|
pipeline parallel |
|
|
|
pipeline schedule |
|
|
|
context parallel (stub; PyTorch 2.11 ring attention) |
|
|
|
expert parallel (MoE only) |
|
|
|
NCCL collective timeout |
|
|
|
|
The product dp_replicate × dp_shard × tp × pp × cp × ep must equal
world_size. Methods: validate_world_size(ws), resolve(ws).
[checkpoint] — CheckpointConfig¶
DCP-based checkpointing.
kempnerforge/config/checkpoint.py.
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
root directory for |
|
|
|
save every N steps |
|
|
|
DCP async-save mode |
|
|
|
retain the most recent N checkpoints |
|
|
|
explicit resume path (overrides |
|
|
|
dtype for HF exports via |
|
|
|
FQN prefixes to skip on load (e.g. to reinit a head) |
[metrics] — MetricsConfig¶
Logging cadence and backend toggles.
kempnerforge/config/metrics.py.
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
log every N steps (stdout + enabled backends) |
|
|
|
turn on WandB backend |
|
|
|
turn on TensorBoard backend |
|
|
|
WandB project name |
|
|
|
|
|
|
|
restored from checkpoint on resume; empty = new run |
|
|
|
TB log directory |
[profiling] — ProfilingConfig¶
torch.profiler window.
kempnerforge/config/profiling.py.
Field |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
|
run the profiler during the loop |
|
|
|
first step recorded |
|
|
|
last step recorded (must be |
|
|
|
output directory for Chrome/Perfetto traces |
Where to read next¶
CLI overrides — reshape any of these fields from the command line.
Validation rules — what
__post_init__andvalidate(world_size)enforce.Registry — how the string keys above (
moe_router,norm_type,optimizer.name,scheduler.name,loss_fn,model_type) resolve to builders.