Getting Started =============== Installation ------------ Requirements ~~~~~~~~~~~~ The following packages are required: * ``PyTorch``: Deep learning framework * ``wandb``: Weights & Biases experiment tracking * ``Hydra``: Configuration management * ``NumPy``: Numerical computing * ``tatm``: Dataset loading support You can install the core requirements using pip:: pip install torch pip install wandb pip install hydra-core pip install numpy For dataset loading, we currently use the ``tatm`` package also developed at the Kempner Institute. Install it using:: pip install git+https://github.com/KempnerInstitute/tatm.git More details about ``tatm`` can be found `here `_. .. note:: In future versions, we will support default dataset loading without requiring external packages. Installation Steps ~~~~~~~~~~~~~~~~~~ 1. If you are using the Kempner AI cluster, load required modules:: module load python/3.12.5-fasrc01 module load cuda/12.4.1-fasrc01 module load cudnn/9.1.1.17_cuda12-fasrc01 If you are not using the Kempner cluster, install torch and cuda dependencies following instructions on the `PyTorch website `_. TMRC has been tested with torch ``2.5.0+cu124``. 2. Create a Conda environment (if you are using the Kempner AI cluster, you may use ``mamba`` instead of ``conda``):: conda create -n tmrc_env python=3.10 conda activate tmrc_env 3. Clone the repository:: git clone git@github.com:KempnerInstitute/tmrc.git 4. Install the package:: cd tmrc pip install poetry poetry install Running Experiments ~~~~~~~~~~~~~~~~~~~ 1. Login to Weights & Biases to enable experiment tracking:: wandb login .. tabs:: .. tab:: Single-GPU Training 2. Request compute resources. For example, on the Kempner AI cluster, to request an H100 GPU:: salloc --partition=kempner_h100 --account= --ntasks=1 --nodes==1 --cpus-per-task=24 --mem=375G --gres=gpu:1 --time=00-07:00:00 If you are not using the Kempner AI cluster, you can run experiments on your local machine (if you have a GPU) or on cloud services like AWS, GCP, or Azure. TMRC should automatically find the available GPU. 3. Activate the Conda environment:: conda activate tmrc_env 4. Launch training:: python src/tmrc/core/training/train.py .. tab:: Multi-Node Multi-GPU Training 2. Request compute resources. For example, on the Kempner AI cluster, to request two nodes with four H100 GPU each:: salloc --partition=kempner_h100 --account= --ntasks-per-node=4 --nodes==2 --cpus-per-task=24 --mem=375G --gres=gpu:4 --time=00-07:00:00 3. Activate the Conda environment:: conda activate tmrc_env 4. Launch training:: srun python src/tmrc/core/training/train.py If you are not using the Kempner AI cluster, TMRC should automatically find nodes and GPUs on a Slurm-based cluster. Or for a single-node multi-GPU, TMRC supports ``torchrun``. For example, to run experiments on four GPUs availble on the current node:: torchrun --nproc_per_node=4 src/tmrc/core/training/train.py .. note:: For distributed training, TMRC uses ``Distributed Data Parallelism (DDP)`` by default. For larger models, to use ``Fully Sharded Data Parallelism (FSDP)``, set ``distributed_strategy`` to ``fsdp`` in the ``training`` part of the config file or see the next section on how to have a custom config file. Configuration ^^^^^^^^^^^^^ By default, the training script uses the configuration defined in ``configs/training/default_train_config.yaml``. To use a custom configuration file:: python src/tmrc/core/training/train.py --config-name YOUR_CONFIG .. note:: The ``--config-name`` parameter should be specified without the ``.yaml`` extension. .. tip:: Configuration files should be placed in the ``configs/training/`` directory. For example, if your config is named ``my_experiment.yaml``, use ``--config-name my_experiment``