tatm.data Data Module API

The tatm.data module provides a set of tools for accessing and manipulating data. The module is designed to be modular and extensible, allowing for easy integration of new datasets and data processing tools.

class tatm.data.TatmData(metadata: TatmDataMetadata, *args, **kwargs)

Bases: ABC

Generic dataset class, provides interface to access multiple types of datasets.

Parameters:

metadata – Metadata object.

abstract classmethod from_metadata(metadata: TatmDataMetadata) TatmData

Create a dataset object from metadata.

Parameters:

metadata – Metadata object.

Returns:

Dataset object.

Return type:

TatmDataset

get_source() str

Return a string representing the source of the data. By default, this is the name of the dataset.

Returns:

Source of the data.

Return type:

str

abstract initialize(*args, **kwargs)

Initialize the dataset.

class tatm.data.TatmDataMetadata(*, name: str, dataset_path: str, description: str, date_downloaded: str, download_source: str, data_content: ~tatm.data.metadata.DataContentType, content_field: str = 'text', corpuses: ~typing.List[str] = <factory>, corpus_separation_strategy: ~tatm.data.metadata.CorpusSeparationStrategy | None = None, corpus_data_dir_parent: str | None = None, tokenized_info: ~tatm.data.metadata.TokenizedMetadataComponenet | None = None)

Bases: object

Generic Dataset Metadata Class holding information about a dataset.

Raises:

ValueError – Raises a ValueError if the data_content value is invalid.

as_json()

Return the metadata as a JSON string.

Returns:

Metadata as a JSON string.

Return type:

str

as_yaml()
content_field: str = 'text'

Field in the dataset that contains the content.

corpus_data_dir_parent: str = None

Parent directory of corpus data directories, to be used with corpus_separation_strategy=’data_dirs’. If None, the corpus name is assumed to map to a directory in the top level of the dataset path.

corpus_separation_strategy: CorpusSeparationStrategy = None

Strategy for separating corpuses in the dataset (data_dirs or configs). data_dirs maps to using the data_dir parameter in the datasets.load_dataset function, and configs maps to using the name parameter.

corpuses: List[str]

List of corpuses in the dataset.

data_content: DataContentType

Type of data in the dataset.

dataset_path: str

Path to the dataset.

date_downloaded: str

Date the dataset was downloaded.

description: str

Description of the dataset.

download_source: str

Source of the dataset.

classmethod from_directory(directory: str | Path)

Load metadata from a directory containing a metadata file named metadata.[yaml or json].

classmethod from_file(path: str | Path)

Load metadata from a file, either JSON or YAML.

classmethod from_json(json_path)

Create a Metadata object from a JSON file.

classmethod from_metadata_store(name: str)
classmethod from_yaml(yaml_path)
name: str

Name of the dataset.

to_json(filename)

Write the metadata to a JSON file.

Parameters:

filename (str) – The path of the file to write the metadata to.

to_yaml(filename)
tokenized_info: TokenizedMetadataComponenet = None

Metadata for tokenized data.

class tatm.data.TatmMemmapDataset(file_prefix: str, context_length: int, dtype: str = 'uint16', chunked: bool = True, file_suffix: str = 'bin', eos_token: int = 1, token_output_format: TokenOutputFormat = TokenOutputFormat.TORCH, vocab_size: int | None = None, create_doc_ids: bool = True, create_doc_mask: bool = False)

Bases: TatmDataset

Class for presenting tatm tokenized datasets to modelling frameworks.

num_files()

Get the number of files in the dataset.

num_tokens()

Get the number of tokens in the dataset.

class tatm.data.TatmTextData(metadata: TatmDataMetadata, *args, **kwargs)

Bases: TatmData

Text dataset class, provides interface to access text datasets.

Parameters:

metadata – Metadata object.

classmethod from_metadata(metadata: TatmDataMetadata, corpus=None, split='train') TatmTextData

Create a TatumTextDataset object from metadata.

Parameters:

metadata – Metadata object.

Returns:

Text dataset object.

Return type:

TatumTextDataset

get_source() str

Returns the name of the dataset. If a corpus is specified, the name of the corpus is returned appended to the dataset name with a colon.

Returns:

Data source in the format “name:corpus”.

Return type:

str

initialize(corpus: str | None = None, split: str = 'train')

Initialize the dataset.

Parameters:

corpus – Corpus to load. Defaults to None.

tatm.data.get_data(identifier: str | TatmDataMetadata) TatmData

Get a dataset object from an identifier.

Parameters:

identifier – Identifier for the dataset.

Returns:

Dataset object.

Return type:

TatmDataset

tatm.data.get_dataset(metadata: str | TatmDataMetadata, **kwargs) TatmDataset

Get the dataset object from the metadata.

Parameters:
  • metadata – The metadata object, or a path to a metadata file or a directory containing a metadata file.

  • **kwargs – Additional arguments to pass to the dataset constructor.

Returns:

The dataset object.

Return type:

TatmDataset

tatm.data.torch_collate_fn(batch: list[TatmDatasetItem]) dict[str, Tensor]

Collate function for torch DataLoader. Assumes that all items in the batch are of the same type.