Getting Started

Installation

Requirements

The following packages are required:

PyTorch: Deep learning framework
wandb: Weights & Biases experiment tracking
Hydra: Configuration management
NumPy: Numerical computing
tatm: Dataset loading support

You can install the core requirements using pip:

pip install torch
pip install wandb
pip install hydra-core
pip install numpy

For dataset loading, we currently use the tatm package also developed at the Kempner Institute. Install it using:

pip install git+https://github.com/KempnerInstitute/tatm.git

More details about tatm can be found here.

Note

In future versions, we will support default dataset loading without requiring external packages.

Installation Steps

If you are using the Kempner AI cluster, load required modules:

module load python/3.12.5-fasrc01
module load cuda/12.4.1-fasrc01
module load cudnn/9.1.1.17_cuda12-fasrc01

If you are not using the Kempner cluster, install torch and cuda dependencies following instructions on the PyTorch website.

TMRC has been tested with torch 2.5.0+cu124.

Create a Conda environment (if you are using the Kempner AI cluster, you may use mamba instead of conda):
```
conda create -n tmrc_env python=3.10
conda activate tmrc_env
```

Clone the repository:

git clone git@github.com:KempnerInstitute/tmrc.git

Install the package:

cd tmrc
pip install poetry
poetry install

Running Experiments

Login to Weights & Biases to enable experiment tracking:
```
wandb login
```

Request compute resources. For example, on the Kempner AI cluster, to request an H100 GPU:

salloc --partition=kempner_h100 --account=<fairshare account> --ntasks=1 --nodes==1 --cpus-per-task=24 --mem=375G --gres=gpu:1  --time=00-07:00:00

If you are not using the Kempner AI cluster, you can run experiments on your local machine (if you have a GPU) or on cloud services like AWS, GCP, or Azure. TMRC should automatically find the available GPU.

Activate the Conda environment:
```
conda activate tmrc_env
```
Launch training:
```
python src/tmrc/core/training/train.py
```

Request compute resources. For example, on the Kempner AI cluster, to request two nodes with four H100 GPU each:

salloc --partition=kempner_h100 --account=<fairshare account> --ntasks-per-node=4 --nodes==2 --cpus-per-task=24 --mem=375G --gres=gpu:4  --time=00-07:00:00

Activate the Conda environment:
```
conda activate tmrc_env
```

Launch training:

srun python src/tmrc/core/training/train.py

If you are not using the Kempner AI cluster, TMRC should automatically find nodes and GPUs on a Slurm-based cluster.

Or for a single-node multi-GPU, TMRC supports torchrun. For example, to run experiments on four GPUs availble on the current node:

torchrun --nproc_per_node=4 src/tmrc/core/training/train.py

Note

For distributed training, TMRC uses Distributed Data Parallelism (DDP) by default. For larger models, to use Fully Sharded Data Parallelism (FSDP), set distributed_strategy to fsdp in the training part of the config file or see the next section on how to have a custom config file.

Configuration

By default, the training script uses the configuration defined in configs/training/default_train_config.yaml.

To use a custom configuration file:

python src/tmrc/core/training/train.py --config-name YOUR_CONFIG

Note

The --config-name parameter should be specified without the .yaml extension.

Tip

Configuration files should be placed in the configs/training/ directory. For example, if your config is named my_experiment.yaml, use --config-name my_experiment