Getting Started
Installation
Requirements
The following packages are required:
PyTorch: Deep learning frameworkwandb: Weights & Biases experiment trackingHydra: Configuration managementNumPy: Numerical computingtatm: Dataset loading support
You can install the core requirements using pip:
pip install torch
pip install wandb
pip install hydra-core
pip install numpy
For dataset loading, we currently use the tatm package also developed at the Kempner Institute. Install it using:
pip install git+https://github.com/KempnerInstitute/tatm.git
More details about tatm can be found here.
Note
In future versions, we will support default dataset loading without requiring external packages.
Installation Steps
TMRC has been tested with torch 2.6.0 and Python 3.12.
Install uv (skip if already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh
Clone the repository:
git clone git@github.com:KempnerInstitute/tmrc.git
Create the environment and install the package (uv reads
.python-versionto select Python 3.12):cd tmrc uv sync
Activate the environment with:
source .venv/bin/activate
Alternatively, prefix any command with
uv runto run it inside the environment without activating it (e.g.,uv run python src/tmrc/core/training/train.py).
Note
uv brings its own Python, and PyTorch wheels bundle the CUDA runtime and cuDNN, so no module load is required on the Kempner AI cluster for the standard training path. Only load cuda/12.4.1-fasrc01 if you build custom CUDA extensions (nvcc) or compile something like flash-attention from source.
Running Experiments
Login to Weights & Biases to enable experiment tracking:
wandb login
Request compute resources. For example, on the Kempner AI cluster, to request an H100 GPU:
salloc --partition=kempner_h100 --account=<fairshare account> --ntasks=1 --nodes==1 --cpus-per-task=24 --mem=375G --gres=gpu:1 --time=00-07:00:00
If you are not using the Kempner AI cluster, you can run experiments on your local machine (if you have a GPU) or on cloud services like AWS, GCP, or Azure. TMRC should automatically find the available GPU.
Activate the environment:
source .venv/bin/activate
Launch training:
python src/tmrc/core/training/train.py
Request compute resources. For example, on the Kempner AI cluster, to request two nodes with four H100 GPU each:
salloc --partition=kempner_h100 --account=<fairshare account> --ntasks-per-node=4 --nodes==2 --cpus-per-task=24 --mem=375G --gres=gpu:4 --time=00-07:00:00
Activate the environment:
source .venv/bin/activate
Launch training:
srun python src/tmrc/core/training/train.py
If you are not using the Kempner AI cluster, TMRC should automatically find nodes and GPUs on a Slurm-based cluster.
Or for a single-node multi-GPU, TMRC supports torchrun. For example, to run experiments on four GPUs availble on the current node:
torchrun --nproc_per_node=4 src/tmrc/core/training/train.py
Note
For distributed training, TMRC uses Distributed Data Parallelism (DDP) by default. For larger models, to use Fully Sharded Data Parallelism (FSDP), set distributed_strategy to fsdp in the training part of the config file or see the next section on how to have a custom config file.
Configuration
By default, the training script uses the configuration defined in configs/training/default_train_config.yaml.
To use a custom configuration file:
python src/tmrc/core/training/train.py --config-name YOUR_CONFIG
Note
The --config-name parameter should be specified without the .yaml extension.
Tip
Configuration files should be placed in the configs/training/ directory. For example, if your config is named my_experiment.yaml, use --config-name my_experiment