tatm.data
Data Module API
The tatm.data
module provides a set of tools for accessing and manipulating data. The module is designed to be modular and extensible, allowing for easy integration of new datasets and data processing tools.
- class tatm.data.TatmData(metadata: TatmDataMetadata, *args, **kwargs)
Bases:
ABC
Generic dataset class, provides interface to access multiple types of datasets.
- Parameters:
metadata – Metadata object.
- abstract classmethod from_metadata(metadata: TatmDataMetadata) TatmData
Create a dataset object from metadata.
- Parameters:
metadata – Metadata object.
- Returns:
Dataset object.
- Return type:
TatmDataset
- get_source() str
Return a string representing the source of the data. By default, this is the name of the dataset.
- Returns:
Source of the data.
- Return type:
str
- abstract initialize(*args, **kwargs)
Initialize the dataset.
- class tatm.data.TatmDataMetadata(*, name: str, dataset_path: str, description: str, date_downloaded: str, download_source: str, data_content: ~tatm.data.metadata.DataContentType, content_field: str = 'text', corpuses: ~typing.List[str] = <factory>, corpus_separation_strategy: ~tatm.data.metadata.CorpusSeparationStrategy | None = None, corpus_data_dir_parent: str | None = None, tokenized_info: ~tatm.data.metadata.TokenizedMetadataComponenet | None = None)
Bases:
object
Generic Dataset Metadata Class holding information about a dataset.
- Raises:
ValueError – Raises a ValueError if the data_content value is invalid.
- as_json()
Return the metadata as a JSON string.
- Returns:
Metadata as a JSON string.
- Return type:
str
- as_yaml()
- content_field: str = 'text'
Field in the dataset that contains the content.
- corpus_data_dir_parent: str = None
Parent directory of corpus data directories, to be used with corpus_separation_strategy=’data_dirs’. If None, the corpus name is assumed to map to a directory in the top level of the dataset path.
- corpus_separation_strategy: CorpusSeparationStrategy = None
Strategy for separating corpuses in the dataset (data_dirs or configs). data_dirs maps to using the data_dir parameter in the datasets.load_dataset function, and configs maps to using the name parameter.
- corpuses: List[str]
List of corpuses in the dataset.
- data_content: DataContentType
Type of data in the dataset.
- dataset_path: str
Path to the dataset.
- date_downloaded: str
Date the dataset was downloaded.
- description: str
Description of the dataset.
- download_source: str
Source of the dataset.
- classmethod from_directory(directory: str | Path)
Load metadata from a directory containing a metadata file named metadata.[yaml or json].
- classmethod from_file(path: str | Path)
Load metadata from a file, either JSON or YAML.
- classmethod from_json(json_path)
Create a Metadata object from a JSON file.
- classmethod from_metadata_store(name: str)
- classmethod from_yaml(yaml_path)
- name: str
Name of the dataset.
- to_json(filename)
Write the metadata to a JSON file.
- Parameters:
filename (str) – The path of the file to write the metadata to.
- to_yaml(filename)
- tokenized_info: TokenizedMetadataComponenet = None
Metadata for tokenized data.
- class tatm.data.TatmMemmapDataset(file_prefix: str, context_length: int, dtype: str = 'uint16', chunked: bool = True, file_suffix: str = 'bin', eos_token: int = 1, token_output_format: TokenOutputFormat = TokenOutputFormat.TORCH, vocab_size: int | None = None, create_doc_ids: bool = True, create_doc_mask: bool = False)
Bases:
TatmDataset
Class for presenting tatm tokenized datasets to modelling frameworks.
- num_files()
Get the number of files in the dataset.
- num_tokens()
Get the number of tokens in the dataset.
- class tatm.data.TatmTextData(metadata: TatmDataMetadata, *args, **kwargs)
Bases:
TatmData
Text dataset class, provides interface to access text datasets.
- Parameters:
metadata – Metadata object.
- classmethod from_metadata(metadata: TatmDataMetadata, corpus=None, split='train') TatmTextData
Create a TatumTextDataset object from metadata.
- Parameters:
metadata – Metadata object.
- Returns:
Text dataset object.
- Return type:
TatumTextDataset
- get_source() str
Returns the name of the dataset. If a corpus is specified, the name of the corpus is returned appended to the dataset name with a colon.
- Returns:
Data source in the format “name:corpus”.
- Return type:
str
- initialize(corpus: str | None = None, split: str = 'train')
Initialize the dataset.
- Parameters:
corpus – Corpus to load. Defaults to None.
- tatm.data.get_data(identifier: str | TatmDataMetadata) TatmData
Get a dataset object from an identifier.
- Parameters:
identifier – Identifier for the dataset.
- Returns:
Dataset object.
- Return type:
TatmDataset
- tatm.data.get_dataset(metadata: str | TatmDataMetadata, **kwargs) TatmDataset
Get the dataset object from the metadata.
- Parameters:
metadata – The metadata object, or a path to a metadata file or a directory containing a metadata file.
**kwargs – Additional arguments to pass to the dataset constructor.
- Returns:
The dataset object.
- Return type:
TatmDataset
- tatm.data.torch_collate_fn(batch: list[TatmDatasetItem]) dict[str, Tensor]
Collate function for torch DataLoader. Assumes that all items in the batch are of the same type.