Getting Started
Installation
Requirements
The following packages are required:
PyTorch
: Deep learning frameworkwandb
: Weights & Biases experiment trackingHydra
: Configuration managementNumPy
: Numerical computingtatm
: Dataset loading support
You can install the core requirements using pip:
pip install torch
pip install wandb
pip install hydra-core
pip install numpy
For dataset loading, we currently use the tatm
package also developed at the Kempner Institute. Install it using:
pip install git+https://github.com/KempnerInstitute/tatm.git
More details about tatm
can be found here.
Note
In future versions, we will support default dataset loading without requiring external packages.
Installation Steps
If you are using the Kempner AI cluster, load required modules:
module load python/3.12.5-fasrc01 module load cuda/12.4.1-fasrc01 module load cudnn/9.1.1.17_cuda12-fasrc01
If you are not using the Kempner cluster, install torch and cuda dependencies following instructions on the PyTorch website.
TMRC has been tested with torch 2.5.0+cu124
.
Create a Conda environment (if you are using the Kempner AI cluster, you may use
mamba
instead ofconda
):conda create -n tmrc_env python=3.10 conda activate tmrc_env
Clone the repository:
git clone git@github.com:KempnerInstitute/tmrc.git
Install the package:
cd tmrc pip install poetry poetry install
Running Experiments
Login to Weights & Biases to enable experiment tracking:
wandb login
Request compute resources. For example, on the Kempner AI cluster, to request an H100 GPU:
salloc --partition=kempner_h100 --account=<fairshare account> --ntasks=1 --nodes==1 --cpus-per-task=24 --mem=375G --gres=gpu:1 --time=00-07:00:00
If you are not using the Kempner AI cluster, you can run experiments on your local machine (if you have a GPU) or on cloud services like AWS, GCP, or Azure. TMRC should automatically find the available GPU.
Activate the Conda environment:
conda activate tmrc_env
Launch training:
python src/tmrc/core/training/train.py
Request compute resources. For example, on the Kempner AI cluster, to request two nodes with four H100 GPU each:
salloc --partition=kempner_h100 --account=<fairshare account> --ntasks-per-node=4 --nodes==2 --cpus-per-task=24 --mem=375G --gres=gpu:4 --time=00-07:00:00
Activate the Conda environment:
conda activate tmrc_env
Launch training:
srun python src/tmrc/core/training/train.py
If you are not using the Kempner AI cluster, TMRC should automatically find nodes and GPUs on a Slurm-based cluster.
Or for a single-node multi-GPU, TMRC supports torchrun
. For example, to run experiments on four GPUs availble on the current node:
torchrun --nproc_per_node=4 src/tmrc/core/training/train.py
Note
For distributed training, TMRC uses Distributed Data Parallelism (DDP)
by default. For larger models, to use Fully Sharded Data Parallelism (FSDP)
, set distributed_strategy
to fsdp
in the training
part of the config file or see the next section on how to have a custom config file.
Configuration
By default, the training script uses the configuration defined in configs/training/default_train_config.yaml
.
To use a custom configuration file:
python src/tmrc/core/training/train.py --config-name YOUR_CONFIG
Note
The --config-name
parameter should be specified without the .yaml
extension.
Tip
Configuration files should be placed in the configs/training/
directory. For example, if your config is named my_experiment.yaml
, use --config-name my_experiment