Shared Data/Model Repository

5.3. Shared Data/Model Repository#

We host several popular ML datasets and models on the Kempner HPC cluster. This approach reduces the need for multiple transfers of the same data/model by researchers and provides a central, read-only repository for all Kempner Institute users to access for their ML workflows. Only the admin team has write access, but users can submit requests for popular data/models. After a careful review, we may place a copy in the shared data/model repository. The current path on the cluster is:

DATA_PATH=/n/holylfs06/LABS/kempner_shared/Lab/data
MODEL_PATH=/n/holylfs06/LABS/kempner_shared/Lab/models

Note

We will develop a web interface later for data and model discovery.

5.3.1. The current list of ML models#

CodeLlama
- Path: $MODEL_PATH/models--codellama--CodeLlama-7b-hf (see on HuggingFace)
  - Size: 16 G

EleutherAI
- Path: $MODEL_PATH/models--EleutherAI--pythia-160m-deduped (see on HuggingFace
  - Size: 435 M
- Path: $MODEL_PATH/models--EleutherAI--pythia-70m-deduped (see on HuggingFace
  - Size: 195 M

OpenAI
- Path: $MODEL_PATH/models--gpt2 (see on HuggingFace)
  - Size: 4.5 M

Google
- Path: $MODEL_PATH/models--t5-base (see on HuggingFace)
  - Size: 3.4 M

5.3.2. The current list of ML datasets#

c4_original
- Path: $DATA_PATH/c4_original
  - Subfolders:
    - preprocessed (434 M)
    - raw (157 G)
  - Description: The original version of the “Colossal Clean Crawled Corpus” (C4) dataset, designed for training natural language processing models.

dolma
- Path: $DATA_PATH/dolma
  - Subfolders:
    - preprocessed (6.8 T)
    - raw (5.9 T)
  - Description: Dolma is a dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

imagenet_winter21_whole
- Path: $DATA_PATH/imagenet_winter21_whole
  - Subfolders:
    - winter21_whole.tar.gz (1.3 T)
  - Description: An updated version of the ImageNet dataset, containing a wide variety of annotated images for visual object recognition, collected during the winter of 2021.

Shared Data/Model Repository

Contents

5.3. Shared Data/Model Repository#

5.3.1. The current list of ML models#

5.3.2. The current list of ML datasets#