KempnerForge

PyTorch-native framework for fault-tolerant distributed training of foundation models on AI clusters.

This site is the canonical documentation for KempnerForge. It builds from the docs/ folder and the package source, and deploys to GitHub Pages on every push to main.

New here?

Start with the getting-started pages, then move to the flagship how-to once you want a real run:

  • Install — prerequisites, uv sync, environment verification, SLURM-specific notes.

  • Quickstart — five-minute walkthrough: debug run → multi-GPU → custom data → optimizer swap → MoE → hooks.

  • Your First Training Run — what the debug run actually did, log line by log line.

  • End-to-end training run — flagship walkthrough: tokenize → write a config → launch 1 GPU → launch 4 GPUs → resume → generate. The integration test of the docs.

  • Notebooks — six interactive notebooks for model inspection, attention visualization, activation extraction, checkpoint analysis, optimizer comparison, MoE routing.

Looking for something specific?

  • How-to Guides — end-to-end researcher workflows: prepare data, scale 1→32 GPUs, compare optimizers, set up MoE experiments, extract activations, debug regressions.

  • Architecture — model forward pass, parallelism application order, data flow through the training loop.

  • Configuration — the typed dataclass system, CLI overrides, registry for swappable components.

Subsystem reference

  • Training — training loop, optimizers, schedulers, loss functions, gradient utilities, hooks, evaluation, generation.

  • Distributed — DeviceMesh, FSDP2, tensor / expert / pipeline parallelism, FP8.

  • Mixture of Experts — routers, capacity and dispatch, auxiliary losses and balancing, FP8 interaction.

  • Data — memory-mapped datasets, HuggingFace streaming, sampler, stateful dataloader, mixing and annealing.

  • Checkpointing — DCP model and train-state, auto-resume, resharding, HuggingFace conversion.

  • Metrics and profiling — metrics tracker, MFU, memory monitor, profiler, WandB / TensorBoard backends.

  • Resilience — SLURM preemption, NaN detection, NCCL liveness, GPU health, elastic training.

Reference Tables and API Documentation

  • Reference — available configs, parallelism recipes, benchmarks, environment variables.

  • API Reference — API reference, auto-generated from docstrings.

Contributing

Documentation PRs follow the same flow as code PRs. The editor loop, build commands, and style conventions live in Contributing § Writing Docs.