# Getting Started ## Installation ### Requirements - Python 3.10 is the minimum supported version. - `tatm` depends on pytorch, which depends on CUDA. It is recommended to pre-install both pytorch and CUDA prior to installing `tatm`. Instructions for installing pytorch can be found [here](https://pytorch.org/get-started/locally/). ### Installing from GitHub To install the latest stable version of `tatm` from GitHub, run the following command: ```bash pip install git+ssh://git@github.com/KempnerInstitute/tatm.git@main ``` To install the latest development version of `tatm` from GitHub, run the following command: ```bash pip install git+ssh://git@github.com/KempnerInstitute/tatm.git@dev ``` For a specific past version of `tatm`, replace `main` or `dev` with the desired version number (i.e. `v0.1.0`). ### Installing from PyPI The package is not yet available on PyPI. Stay tuned for updates! ## Loading Tokenized Data with `tatm` for use with PyTorch In the example code below, we show how to create a PyTorch dataloader with a tokenized dataset for use with a model. ```{note} If your site is set up with a metadata backend you can use semantic names for the dataset instead of the path to the tokenized data. See [](admin_docs/metadata_store_setup.md) for more information. ``` ```python import numpy as np import torch from torch.utils.data import DataLoader from tatm.data import get_dataset, torch_collate_fn tatm_dataset = get_dataset("", context_length=1024) len(tatm_dataset) # number of examples in set # 35651584 tatm_dataset.num_tokens() # 36507222016 tatm_dataset.num_files() # 34 tatm_dataset.vocab_size # 32100 tatm_dataset[3] # Note that the output will vary depending on the dataset and the tokenization process as the order documents are tokenized may vary. # TatmMemmapDatasetItem( # token_ids=array([ 7, 16, 8, ..., 14780, 8, 2537], dtype=uint16), # document_ids=array([0, 0, 0, ..., 1, 1, 1], dtype=uint16) # ) dataloader = DataLoader(tatm_dataset, batch_size=4, collate_fn=torch_collate_fn) print(next(iter(dataloader))) # {'token_ids': tensor([[ 3, 2, 14309, ..., 1644, 4179, 16], # [ 3731, 3229, 2, ..., 15, 2, 3], # [ 2, 14309, 2, ..., 356, 5, 22218], # [ 7, 16, 8, ..., 14780, 8, 2537]], dtype=torch.uint16), # 'document_ids': tensor([[0, 0, 0, ..., 0, 0, 0], # [0, 0, 0, ..., 0, 0, 0], # [0, 0, 0, ..., 0, 0, 0], # [0, 0, 0, ..., 1, 1, 1]], dtype=torch.uint16)} ```