Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
KempnerForge
Logo

Getting Started

  • Getting Started
    • Install
    • Quickstart
    • Your First Training Run
    • Notebooks

Architecture and How-to

  • Architecture
    • Model
    • Parallelism Order
    • Data Flow
  • How-to Guides
    • Build a model
    • Prepare tokenized data
    • End-to-end training run
    • Scaling guide
    • SLURM distributed setup
    • Run evaluation
    • Generate from a checkpoint
    • Debug training regressions
    • Compare optimizers
    • Mix datasets and anneal data weights
    • Turn on FP8 training
    • MoE experiments
    • Extract activations for interpretability

Subsystems

  • Training
    • Training loop
    • Optimizers
    • Schedulers
    • Losses
    • Gradient utilities
    • Evaluation
    • Generation
    • Hooks
  • Distributed
    • Device mesh
    • FSDP2
    • Tensor parallelism
    • Expert parallelism
    • Pipeline parallelism
    • FP8
  • Mixture of Experts
    • Routers
    • Aux loss and balancing
    • Capacity and dispatch
    • MoE + FP8
  • Data
    • Memory-mapped dataset
    • HuggingFace datasets
    • Mixing and annealing
    • Stateful dataloader
    • Sampler
  • Checkpointing
    • DCP model + optimizer
    • Resharding
    • Train state
    • Auto-resume
    • HuggingFace conversion
  • Metrics and profiling
    • Metrics tracker
    • WandB backend
    • TensorBoard backend
    • MFU
    • Memory monitoring
    • Profiler
  • Resilience
    • SLURM preemption
    • NaN detection
    • GPU health
    • NCCL liveness
    • SLURM elastic helpers

Configuration and Reference

  • Configuration
    • Config sections
    • CLI overrides
    • Validation rules
    • Registry
  • Reference
    • Available configs
    • Parallelism recipes
    • Benchmarks
    • Environment variables

API

  • API Reference
    • kempnerforge.config
      • kempnerforge.config.checkpoint
      • kempnerforge.config.data
      • kempnerforge.config.distributed
      • kempnerforge.config.eval
      • kempnerforge.config.job
      • kempnerforge.config.loader
      • kempnerforge.config.metrics
      • kempnerforge.config.model
      • kempnerforge.config.optimizer
      • kempnerforge.config.profiling
      • kempnerforge.config.registry
      • kempnerforge.config.scheduler
      • kempnerforge.config.schema
      • kempnerforge.config.training
    • kempnerforge.model
      • kempnerforge.model.attention
      • kempnerforge.model.embedding
      • kempnerforge.model.generate
      • kempnerforge.model.hooks
      • kempnerforge.model.init
      • kempnerforge.model.mlp
      • kempnerforge.model.moe
      • kempnerforge.model.norm
      • kempnerforge.model.position
      • kempnerforge.model.router
      • kempnerforge.model.transformer
    • kempnerforge.distributed
      • kempnerforge.distributed.expert_parallel
      • kempnerforge.distributed.parallel
      • kempnerforge.distributed.pipeline_parallel
      • kempnerforge.distributed.setup
      • kempnerforge.distributed.tensor_parallel
      • kempnerforge.distributed.utils
    • kempnerforge.data
      • kempnerforge.data.dataloader
      • kempnerforge.data.dataset
      • kempnerforge.data.sampler
    • kempnerforge.training
      • kempnerforge.training.eval
      • kempnerforge.training.grad
      • kempnerforge.training.hooks
      • kempnerforge.training.loss
      • kempnerforge.training.optimizer
      • kempnerforge.training.scheduler
    • kempnerforge.checkpoint
      • kempnerforge.checkpoint.async_save
      • kempnerforge.checkpoint.manager
      • kempnerforge.checkpoint.state
    • kempnerforge.resilience
      • kempnerforge.resilience.elastic
      • kempnerforge.resilience.health
      • kempnerforge.resilience.signal_handler
    • kempnerforge.metrics
      • kempnerforge.metrics.logger
      • kempnerforge.metrics.memory
      • kempnerforge.metrics.mfu
      • kempnerforge.metrics.tracker
    • kempnerforge.profiling
      • kempnerforge.profiling.cuda_timer
      • kempnerforge.profiling.profiler

Agent Integration

  • KempnerForge plugin

Project

  • Contributing to KempnerForge
Back to top
Copyright © 2026, Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University ยท v0.1.0
Made with Sphinx and @pradyunsg's Furo