tatm
CLI Reference
The tatm
library includes a command-line interface (CLI) that provides a set of tools for interacting with the library. The CLI is designed to be
easy to use and to provide a consistent interface for common tasks. The CLI is built using the click
library, which provides a simple and
intuitive way to define command-line interfaces in Python. All commands include a help message that describes the command and provides detailed
information on the options that can be accessed by passing a --help
flag. This document provides an overview of the commands available in the
tatm
CLI and how to use them.
Organization
The tatm
CLI consists of both top level commands and (where needed) subcommands. This documment is organized following the structure of the CLI.
Installation
Installing the tatm
library will automatically install the CLI to your environment.
See here for more information on installing the tatm
library.
Commands
tatm run
Usage: tatm run [OPTIONS] [WRAPPED_COMMAND]...
The `tatm run` command is used to wrap other tatm commands and run them in a
specified compute environment. It uses the configuration files and options
to determine how to run the command. `tatm run` will use a template submit
script along with your specified environment to run the command. It will
then submit the job to the compute environment. If you do not want to submit
the job, you can use the `--no-submit` flag. The generated submit script
will be placed in the current working directory. The submit script will be
named `tatm_{command}.submit` where `{command}` is the command you are
running unless you specify a different name with the `--submit-script`
option.
If the script is not submitted the command will print the submit command to
the console.
The WRAPPED_COMMAND argument is the command that will be run. Currently only
wrapping ray based commands on slurm is supported.
Options:
--config, --conf TEXT Path to configuration file or specific
configuration settings. First, all config
files are merged, then all config options
are merged. Command line options override
config file options. CLI config options
should be in the format
`field.subfield=value`. For list types, use
dotlist notation (e.g.
`field.subfield=[1,2,3]`). Note that this
will override any list values in the config
file. Also note that this may cause issues
with your shell, so be sure to quote the
entire argument.
-N, --nodes INTEGER Number of nodes to use for wrapped command.
-c, --cpus-per-task, --cpus INTEGER
Number of CPUs to use per task.
--submit-script TEXT Path to submit script to create.
--time-limit TEXT Time limit for the job.
--memory, --mem TEXT Memory to allocate for the job.
--gpus-per-node TEXT Number of GPUs to use per node.
--constraints TEXT Constraints for the job.
-o, --log-file TEXT Log file for the job.
-e, --error-file TEXT Error file for the job.
--submit / --no-submit Submit the job after creating the submit
script. Set to False to only create the
submit script.
--help Show this message and exit.
Example usage:
tatm run --conf slurm.partition=example --conf slurm.account=example -N 4 -c 40 tokenize --output-dir /$OUTPUT_DIR/test_tatm_out/tokenize -v /$DATADIR/redpajama-v1/
will run the tokenize
command with the specified configuration file create a 4 node, 160 CPU ray cluster and tokenize the dataset located at /DATADIR/redpajama-v1/
and output the tokenized data to /OUTPUT_DIR/test_tatm_out/tokenize
using 158 workers (2 CPUS are reserved for the writer and reader processes).
The submission script will be created in the current working directory and will be named tatm_tokenize.submit
. The executed command will be
/usr/bin/sbatch --nodes 4 --cpus-per-task 40 --mem 40G --time 1-00:00:00 --partition example --account example --job-name tatm_tokenize --output tatm_tokenize.out $PWD/tatm_tokenize_job.submit
tatm data
The data command provides a set of sub commands for with the data layer functionality of the library.
tatm data create-metadata
The create-metadata
command kicks off an interactive process to create a metadata file for a dataset. This metadata file
encodes information about the dataset, such as the location of the data files, the format of the data, and any other relevant
information. The metadata file is used by the library to load and process the data.
Usage: tatm data create-metadata [OPTIONS]
Options:
--help Show this message and exit.
tatm tokenize
Usage: tatm tokenize [OPTIONS] [DATASETS]...
Tokenize a dataset using the tatm ray based tokenization engine. If running
in a cluster environment, it is recommended to use this command in
conjunction with the `run` command to submit the tokenization job to the
cluster.
This command will tokenize the input datasets using the specified tokenizer
and output the tokenized data to the specified output directory as a series
of binary files. The number of workers to use for tokenization can be
specified using the `--num-workers` option. If not specified, the number of
workers will be determined by the number of available CPUs to the ray
cluster.
Arguments: DATASETS: Paths to the datasets to tokenize. This command can
accept multiple datasets to tokenize. All datasets are expected to have a
tatm metadata file associated with them.
Options:
--num-workers TEXT Number of workers to use for tokenization
--tokenizer TEXT Tokenizer to use for tokenization
--output-dir PATH Output directory for tokenized data
-v, --verbose Enable verbose logging
--file-prefix TEXT Prefix for tokenized files
<<<<<<< HEAD
--token-dtype TEXT Numpy data type for tokenized files
=======
>>>>>>> main
--help Show this message and exit.