Getting Started

Installation

Requirements

  • Python 3.10 is the minimum supported version.

  • tatm depends on pytorch, which depends on CUDA. It is recommended to pre-install both pytorch and CUDA prior to installing tatm. Instructions for installing pytorch can be found here.

Installing from GitHub

To install the latest stable version of tatm from GitHub, run the following command:

pip install git+ssh://git@github.com/KempnerInstitute/tatm.git@main

To install the latest development version of tatm from GitHub, run the following command:

pip install git+ssh://git@github.com/KempnerInstitute/tatm.git@dev

For a specific past version of tatm, replace main or dev with the desired version number (i.e. v0.1.0).

Installing from PyPI

The package is not yet available on PyPI. Stay tuned for updates!

Loading Tokenized Data with tatm for use with PyTorch

In the example code below, we show how to create a PyTorch dataloader with a tokenized dataset for use with a model.

Note

If your site is set up with a metadata backend you can use semantic names for the dataset instead of the path to the tokenized data. See Metadata Store Setup for more information.

import numpy as np
import torch
from torch.utils.data import DataLoader

from tatm.data import get_dataset, torch_collate_fn
tatm_dataset = get_dataset("<PATH TO TATM TOKENIZED DATA>", context_length=1024)
len(tatm_dataset) # number of examples in set
# 35651584
tatm_dataset.num_tokens()
# 36507222016
tatm_dataset.num_files()
# 34
tatm_dataset.vocab_size
# 32100
tatm_dataset[3]
# Note that the output will vary depending on the dataset and the tokenization process as the order documents are tokenized may vary.
# TatmMemmapDatasetItem(
#    token_ids=array([    7,    16,     8, ..., 14780,     8,  2537], dtype=uint16), 
#    document_ids=array([0, 0, 0, ..., 1, 1, 1], dtype=uint16)
# )

dataloader = DataLoader(tatm_dataset, batch_size=4, collate_fn=torch_collate_fn)
print(next(iter(dataloader)))
# {'token_ids': tensor([[    3,     2, 14309,  ...,  1644,  4179,    16],
#         [ 3731,  3229,     2,  ...,    15,     2,     3],
#         [    2, 14309,     2,  ...,   356,     5, 22218],
#         [    7,    16,     8,  ..., 14780,     8,  2537]], dtype=torch.uint16), 
#    'document_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
#         [0, 0, 0,  ..., 0, 0, 0],
#         [0, 0, 0,  ..., 0, 0, 0],
#         [0, 0, 0,  ..., 1, 1, 1]], dtype=torch.uint16)}