KempnerForge¶

PyTorch-native framework for fault-tolerant distributed training of foundation models on AI clusters.

This site is the canonical documentation for KempnerForge. It builds from the docs/ folder and the package source, and deploys to GitHub Pages on every push to main.

New here?¶

Start with the getting-started pages, then move to the flagship how-to once you want a real run:

Install — prerequisites, uv sync, environment verification, SLURM-specific notes.
Quickstart — five-minute walkthrough: debug run → multi-GPU → custom data → optimizer swap → MoE → hooks.
Your First Training Run — what the debug run actually did, log line by log line.
End-to-end training run — flagship walkthrough: tokenize → write a config → launch 1 GPU → launch 4 GPUs → resume → generate. The integration test of the docs.
Notebooks — six interactive notebooks for model inspection, attention visualization, activation extraction, checkpoint analysis, optimizer comparison, MoE routing.

Looking for something specific?¶

How-to Guides — end-to-end researcher workflows: prepare data, scale 1→32 GPUs, compare optimizers, set up MoE experiments, extract activations, debug regressions.
Architecture — model forward pass, parallelism application order, data flow through the training loop.
Configuration — the typed dataclass system, CLI overrides, registry for swappable components.

Subsystem reference¶

Training — training loop, optimizers, schedulers, loss functions, gradient utilities, hooks, evaluation, generation.
Distributed — DeviceMesh, FSDP2, tensor / expert / pipeline parallelism, FP8.
Mixture of Experts — routers, capacity and dispatch, auxiliary losses and balancing, FP8 interaction.
Data — memory-mapped datasets, HuggingFace streaming, sampler, stateful dataloader, mixing and annealing.
Checkpointing — DCP model and train-state, auto-resume, resharding, HuggingFace conversion.
Metrics and profiling — metrics tracker, MFU, memory monitor, profiler, WandB / TensorBoard backends.
Resilience — SLURM preemption, NaN detection, NCCL liveness, GPU health, elastic training.

Reference Tables and API Documentation¶

Reference — available configs, parallelism recipes, benchmarks, environment variables.
API Reference — API reference, auto-generated from docstrings.

Contributing¶

Documentation PRs follow the same flow as code PRs. The editor loop, build commands, and style conventions live in Contributing § Writing Docs.