Home

Awesome

DataComp-LM (DCLM)

Table of Contents

Introduction

DataComp-LM (DCLM) is a comprehensive framework designed for building and training large language models (LLMs) with diverse datasets. It offers a standardized corpus of over 300T unfiltered tokens from CommonCrawl, effective pretraining recipes based on the open_lm framework, and an extensive suite of over 50 evaluations. This repository provides tools and guidelines for processing raw data, tokenizing, shuffling, training models, and evaluating their performance.

DCLM enables researchers to experiment with various dataset construction strategies across different compute scales, from 411M to 7B parameter models. Our baseline experiments show significant improvements in model performance through optimized dataset design.

Already, DCLM has enabled the creation of several high-quality datasets that perform well across scales and outperform all open datasets. Accuracy vs compute tradeoff

<p align="center"> <em><b>Developing datasets for better models that are cheaper to train.</b> Using DataComp-LM, we develop a high-quality dataset, DCLM-BASELINE, which we use to train models with strong compute performance tradeoffs. We compare on both a Core set of tasks (left) and on MMLU 5-shot (right). DCLM-BASELINE (orange) shows favorable performance relative to both closed-source models (crosses) and other open-source datasets and models (circles).</em> </p>

Submission workflow:

For more details, please refer to our paper.

Leaderboard

The DCLM leaderboard showcases the performance of models trained on various scales and datasets. The leaderboard is updated regularly with the latest submissions from the community.

Below are comparisions of our model with others in the 7B regime.

ModelParamsTokensOpen dataset?COREMMLUEXTENDED
Open weights, closed datasets
Llama27B2T49.245.834.1
DeepSeek7B2T50.748.535.3
Mistral-0.37B?57.062.745.1
QWEN-27B?57.571.950.5
Llama38B15T57.666.246.3
Gemma8B6T57.864.344.6
Phi-37B?61.069.957.9
Open weights, open datasets
Falcon7B1T44.127.425.1
OLMo-1.77B2.1T47.054.034.2
MAP-Neo7B4.5T50.257.140.4
Models we trained
FineWeb edu7B0.14T38.726.322.1
FineWeb edu7B0.28T41.937.324.5
DCLM-BASELINE7B0.14T44.138.325.0
DCLM-BASELINE7B0.28T48.950.831.8
DCLM-BASELINE7B2.6T57.163.745.4

Getting Started

To get started with DCLM, follow these steps:

  1. Clone the repository:

    git clone https://github.com/mlfoundations/DCLM.git
    cd DCLM
    
  2. Install dependencies:

    pip install -r requirements.txt
    

    Before installing the dependencies, make sure cmake, build-essential, and g++ are installed, e.g., by installing:

    apt install cmake build-essential
    apt install g++-9
    update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90
    

    To download additional models and data needed for baseline reproduction, run:

    python setup.py install
    
  3. Set up your environment: DCLM uses AWS for storage and possible as a compute backend, and ray for distributed processing. Ensure you have the necessary environment variables and configurations for AWS and Ray clusters.

    We recommend the use of Python 3.10 with DCLM.

Workflow Overview and Exp Data

At a high-level, the DCLM workflow consists of five main steps: (1) raw source selection; (2) data processing; (3) tokenization and shuffling; (4) model training; (5) model evaluation. To be eligible for our leaderboards, you are required to use our scripts for steps (3)-(5). For data processing, you are welcome to use and/or build upon our code, but are also free to perfrom this step with your own implementation.

To help keep track of and run experiments, this repository uses the exp_data folder to log a record of the inputs and outputs of these steps (e.g., datasets created, models trained, and evaluations finished). Each record is called a "reference JSON" and can be thought of as an ID card for a particular input or output asset. Overall, there are four categories of reference JSONs:

For steps after raw source selection, most of our scripts require you to specify the correct input reference JSON(s) and a JSON for the output is generated automatically via Python-based code. By default, generated reference JSONs will also contain relevant pointers to the inputs that resulted in the referenced asset (e.g. a model JSON will contain a pointer to the tokenized dataset that was used to train it). Specifically reference JSONs are used as follows:

  1. Data processing and Tokenize-Shuffle both take in a relative path to an untokenized dataset JSON to detect where the input dataset is stored. Using our ray-based implementations will generate untokenized and tokenized dataset JSONs respectively.
  2. Training takes in a relative path to a tokenized dataset JSON to detect the location of the dataset and its corresponding manifest file. A model JSON will be generated.
  3. Evaluation can take the uuid of a model to automatically populate most of the evaluation script's arguments. An evaluation JSON will be generated.

If needed (e.g., when setting up a new raw source or when using your own code for data processing), you can manually create a JSON by copying the format of one of the existing files and modifying the relevant fields. When doing so, it is key to generate a new uuid (i.e., python -c "import uuid; print(uuid.uuid4().__str__())") to replace the existing one because that is the unique identifier the code uses to identify each JSON. Paths in the JSON may be local as long as they are accessible when the json is used.

(1) Selecting Raw Sources

Filtering Track

We provide multiple standardized input pools that serve as starting points for participants to apply their curation strategies. These datasets vary along two key dimensions:

To decide upon compute scale, participants should consider their available resources and refer to the training costs in the Model Training section. As for pre-processing, our main set of competition pools are subsets of DCLM-Pool, consisting of unfiltered Common Crawl data that has only undergone text extraction (via resiliparse). Meanwhile, we also provide standardized subsets of DCLM-RefinedWeb, which applies all processing steps from DCLM-Baseline to DCLM-Pool with the exception of our OH2.5 + ELI5 fastText filter. This set of pools can be used by participants who wish to more conveniently test a curation strategy as a drop-in replacement for our fastText classifier (without needing to repeat the earlier parts of our DCLM-Baseline pipeline).

Note that as of now, the DCLM-RefinedWeb pools do not match up exactly to being procesed versions of the DCLM-Pool competition pools. Hence, we do not combine their resulting submissions into the same leaderboard. For more details about downloading a specific input pool, please see the Downloading Artifacts section.

After downloading the appropriate starting pool, you will need to create a reference JSON for it if you wish to use/build upon our processing code. Most importantly, this JSON should contain the URL to the where your copy of the input pool exists.

Mixing Track

For the mixing track, participants are welcome to create and register a new source (e.g., Wikipedia, GitHub, etc.). In this case, they should follow the below steps:

(2) Processing the Data

Given a desired raw dataset, this is the key step in which you as a participant apply your own custom data curation strategies. To get started, you can (but are NOT required to) use and build upon our data processing scripts which include various filtering and deduplication operations implemented in our codebase. Specifically, these operations fall into two groups.

Ray-based filtering and cleaning

Our codebase incudes a ray-based data processing engine that is best suited for highly-parallelized application of "local operations" that independently operate on each page (e.g., filtering). The entrypoint for this engine is ray_processing/process.py and using it involves the following steps.

  1. Define a set of processing steps: Create a pipeline config YAML file specifying the operations. See our reproduction of C4 for example. Further details on defining a pipeline can be found here.

  2. Set up a Ray cluster: The data processing script relies on Ray for distributed processing of data. This cluster can be either launched on a single node (for small scale data processing) or using AWS EC2 instances. There is also work to deploy Ray on slurm setups, though this effort is still a work-in-progres.

    To launch a local cluster, use the following command:

    ray start --head --port 6379
    

    To launch a cluster using AWS EC2 instances, use the following:

    ray up <your_cluster_config>
    

    where <your_cluster_config> is a cluster configuration script that depends on your specific use case. We invite the reader to go over the Ray documentation for details on how to create this config file.

    Important: When using EC2 instances, make sure to tear down your cluster with ray down <your_cluster_config> after your job finishes, so as not to incur unnecessary costs!

    A sample config file can be seen here (make sure to adapt to your needs):

    cluster_name: test-processing
    max_workers: 2
    upscaling_speed: 1.0
    available_node_types:
        ray.head.default:
            resources: {}
            node_config:
                ImageId: ami-0c5cce1d70efb41f5
                InstanceType: i4i.4xlarge
                IamInstanceProfile:
                    # Replace 000000000000 with your IAM account 12-digit ID
                    Arn: arn:aws:iam::000000000000:instance-profile/ray-autoscaler-v1
        ray.worker.default:
            min_workers: 2
            max_workers: 2
            node_config:
                ImageId: ami-0c5cce1d70efb41f5
                InstanceType: i4i.4xlarge
                IamInstanceProfile:
                    # Replace 000000000000 with your IAM account 12-digit ID
                    Arn: arn:aws:iam::000000000000:instance-profile/ray-autoscaler-v1
    
    # Cloud-provider specific configuration.
    provider:
        type: aws
        region: us-west-2
        cache_stopped_nodes: False
    
    setup_commands:
        - sudo mkfs -t xfs /dev/nvme1n1
        - sudo mount /dev/nvme1n1 /tmp
        - sudo chown -R $USER /tmp
        - sudo chmod -R 777 /tmp
        - wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -O miniconda.sh
        - bash ~/miniconda.sh -f -b -p /tmp/miniconda3/
        - echo 'export PATH="/tmp/miniconda3/bin/:$PATH"' >> ~/.bashrc
        # Include your AWS CREDS here
        - echo 'export AWS_ACCESS_KEY_ID=' >> ~/.bashrc
        - echo 'export AWS_SECRET_ACCESS_KEY=' >> ~/.bashrc
        - pip install --upgrade pip setuptools wheel
        - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"
        - pip install boto3==1.26.90
        - pip install s3fs==2022.11.0
        - pip install psutil
        - pip install pysimdjson
        - pip install pyarrow
        - git clone https://github.com/mlfoundations/dclm.git
        - pip install -r dclm/requirements.txt
        - cd dclm && python3 setup.py install
    
  3. Run the processing script: To run the processing script, in the case of a local cluster, simply run the following command:

    python3 ray_processing/process.py --source_ref_paths <source_json> --readable_name <name> --output_dir <s3_output_dir> --config_path <config_yaml> --source_name <source_name>
    

    When using EC2 instances, you need to connect to the cluster and then launch the command

    # In your local terminal
    ray attach <your_cluster_config>
    
    # Inside the cluster EC2 instance
    cd dclm
    export PYTHONPATH=$(pwd)
    python3 ray_processing/process.py --source_ref_paths <source_json> --readable_name <name> --output_dir <s3_output_dir> --config_path <config_yaml> --source_name <source_name>
    
  4. Monitor and tear down: You can track the progress of data processing via the global_stats.jsonl file in the output directory. If run successfully, a new untokenized dataset json should have been created in exp_data/datasets/untokenized/ that you should commit and push if you wish to save it. After the job finishes, you can tear down your cluster via ray stop (in the local cluster case) or ray down <your_cluster_config> (in the AWS EC2 case). THIS IS VERY IMPORTANT TO NOT INCUR ADDITIONAL COSTS WHEN USING EC2!

Deduplication

Our most efficient (and recommended) deduplication implementations were written in Rust instead of Python. To deduplicate the raw text as we have done in DCLM-Baseline, we use the rust tools provided in the dedup subdirectory, which icludes several programs for inter-document fuzzy deduplication, i.e., identifying near-duplicates across documents in the corpus. In DCLM-Baseline, we specifically use BFF (dedup/bff). Specific instructions to run deduplication are contained in the README in each of the directories containing the rust tools.

Unfortunately, as of now, these rust implementations are not integrable with the ray-based pipelines described above (i.e., cannot be added to the YAML configs). In contrast, we also did create tooling built with ray to identify exact content and URL duplicates (ray_processing/dedup_jsonl.py). This can be integrated into ray-based workflows above but we do not use this form of deduplication in DCLM-Baseline.

(3) Tokenization and Shuffling

Once you have finished all of your data curation, you can now prepare your dataset for evaluation by first tokenizing and shuffling it in preparation for training. We support both a rust-based and ray-based tokenize shuffle, though we recommend the rust based approach for most workflows because it is single machine and is more efficient. The ray-based approach is still useful for datasets that are too large to be processed on a single machine.

Rust Based

System requirements:

While the code supports reading from and writing to AWS, it is sometimes unreliable and we strongly recommend copying your data from S3 to local using AWS CLI or s5cmd and using local paths.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs > rustup.sh
bash rustup.sh -y
source ~/.bashrc

# from DCLM folder
cd rust_processing/tokshuf-rs
cargo build --release

# if needed, copy dataset from s3 to local

cargo run --release -- \
--input path/to/raw/dataset \
--local-cell-dir tmp/path/to/storage/for/local/cells \
--output path/to/output/location \
--tokenizer "EleutherAI/gpt-neox-20b" \ #other supported option is "meta-llama/Meta-Llama-3-8B"
--seqlen 2049 \
--wds-chunk-size 8192 \
--num-local-cells 512 # 512 is a good compromise, but might need to raise this much higher for really large datasets

Example command:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs > rustup.sh
bash rustup.sh -y
source ~/.bashrc

# from DCLM folder
cd rust_processing/tokshuf-rs
cargo build --release

aws s3 cp --recursive s3://commoncrawl/contrib/datacomp/DCLM-baseline/global-shard_03_of_10/local-shard_1_of_10/ dclm_local
mkdir tokshuf_tmp
mkdir dclm_tokshuf

cargo run --release -- \
--input dclm_local \
--local-cell-dir tokshuf_tmp \
--output dclm_tokshuf \
--tokenizer "EleutherAI/gpt-neox-20b" \
--seqlen 2049 \
--wds-chunk-size 8192 \
--num-local-cells 512

Since this is rust based code, this will not automatically generate a dataset json in exp_data. You can copy an existing file like exp_data/datasets/tokenized/rw_v2_w_substr_cc_v3_f0.15_resiliparse_try3_100_nodes.json to exp_data/datasets/tokenized/dclm_rust_tok_shuf_test.json, and make sure to change the relevant fields (most important are uuid, name, dataset_url, manifest_url).

Ray Based

After processing the raw text, you should convert it into tokenized datasets and perform shuffling for training:

  1. Set up a Ray cluster: Set up a Ray cluster in the same way as the processing step.

  2. Run the tokenize and shuffle script:

    python ray_processing/tokenize_shuffle.py --source_ref_paths <source_jsons> --readable_name <name> --output <s3_output_dir> --content_key text --do_sample --default_dataset_yaml <mixing_yaml>
    
  3. Tear down: Tear down the Ray cluster as in the processing step.

The tokenize_shuffle.py script creates a dataset in webdataset format, along with a manifest.jsonl file. This file is required by the training script, and it contains information on the number of sequences inside each shard of the dataset. If needed, this manifest file can also be created manually, via the following command:

python -m open_lm.utils.make_wds_manifest --data-dir <tokenized_data_dir>

(4) Model Training

To train a model using the tokenized dataset and one of our fixed training recipes:

  1. Run the training script:

    torchrun --nproc-per-node <num_gpus> -m training.train -- --scale <scale> --data-config <tokenized_json> --logs <log_dir> --attn-name torch_attn [--remote-sync <s3_bucket>] [--report-to-wandb] [--num-checkpoints checkpoints] [--multiple-data-passes] [--acc 4] [--torchcompile]
    

    Argument explanations:

    • scale can be found in training/configs (do not include path and .json)
    • data-config is dataset in exp_data: exp_data/datasets/tokenized (include path and .json)
    • logs is where you want local logs to be written
    • attn-name specifies the attention implementation (torch_attn recommended)
    • remote-sync is where the checkpoints are written to on s3
    • report-to-wandb logs to wandb
    • num-checkpoints sets number of checkpoints to save, best effort and may not be actually this number
    • multiple-data-passes enables training on multiple epochs for the dataset; you may need to adjust num-checkpoints if the dataset is too small and does not contain enough files to properly split across gpus/workers/checkpoints
    • acc sets gradient accumulation, which is automatically specified by the scale; however you may want to manually adjust based on resources available
    • torchcompile uses torchcompile to speed up training, and is available for scales that include "fast"

    Example command:

     torchrun --nproc-per-node 8 -m training.train -- --scale 1b_1x_fast --data-config exp_data/datasets/tokenized/rw_v2_w_substr_cc_v3_f0.15_resiliparse_try3_100_nodes.json --logs rw_training_local_logs --attn-name torch_attn --torchcompile
    

    Note that this example will not work until you change the dataset_url and manifest_url in exp_data/datasets/tokenized/rw_v2_w_substr_cc_v3_f0.15_resiliparse_try3_100_nodes.json.

You can expect the following training times per track:

ScaleModel parametersTrain tokensTrain FLOPsTrain H100 hoursPool size
400M-1x412M8.2B2.0e1926137B
1B-1x1.4B28B2.4e202401.64T
1B-5x1.4B138B1.2e2112008.20T
7B-1x6.9B138B5.7e2137007.85T
7B-2x6.9B276B1.1e22730015.7T
  1. Monitor and manage your training jobs: Use slurm sbatch scripts or Sagemaker for running experiments on various compute infrastructures.

(5) Model Evaluation

Evaluation yamls are found in the "eval" folder and they determine which tasks to evaluate. For example, "eval/mmlu_and_lowvar.yaml" contains the tasks to be able to compute the core metric and MMLU, while "eval/heavy.yaml" runs additional tasks so that aggregate can also be computed.

Evaluate trained models using the following methods:

  1. Preferred Method: This method uses the UUID from the json that is automatically created from training which can be found in exp_data/models.

    python tools/eval_expdb.py --num_gpus <num_gpus> --no_skip --output_dir <output_dir> --eval_yaml "eval/<eval_yaml>" -f "uuid=<model_uuid>" --skip_perplexity
    

    Example command: If the model we trained saved its json at exp_data/models/rw_original-open_lm_1b-5.0.json, the UUID is c014f9d6-51d0-429d-9a3e-fe82e53c37fd. Note that this will not actually run because the checkpoint_url and params_url are not publicly accessible; but the automatically generated json for your model will have the correct and accessible paths.

     python tools/eval_expdb.py --num_gpus 8 --no_skip --output_dir exp_data/evals/ --eval_yaml "eval/mmlu_and_lowvar.yaml" -f "uuid=c014f9d6-51d0-429d-9a3e-fe82e53c37fd" --skip_perplexity
    
  2. Direct Evaluation: This method requires manually specifying the checkpoint path and model params file (both are outputs from training), as well as specifying the model config (training/open_lm_configs).

    torchrun --nproc_per_node <num_gpus> eval/eval_openlm_ckpt.py --checkpoint <checkpoint> --eval-yaml <eval_yaml> --config <model_params_file> --model <open_lm_config> --output-file <output_file_path>
    

We also support evaluation on existing huggingface models:

    torchrun --nproc_per_node 8 eval/eval_openlm_ckpt.py --hf-model allenai/OLMo-1B-0724-hf --tokenizer allenai/OLMo-1B-0724-hf --eval-yaml "eval/mmlu_and_lowvar.yaml" --output-file exp_data/evals/olmo_eval_mmlu_and_lowvar.json --donot-compute-perplexity

Submission

When you finish training and evaluating your model, a model eval JSON file has been generated and is at exp_data/evals.

You can now open a pull request to the main repository to share your results with the team and submit it to the leaderboard.

Downloading Artifacts

Datasets

We provide multiple datasets, both as starting points for each of the competition scales, as well as the results of our processing pipeline.

As an important note, downloads from the Common Crawl bucket do require having a valid set of AWS account credentials (i.e., AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, actually copying the data should not incur costs to your account.

Pretrained Models

We provide links to models pretrained using our dataset via the DCLM collection on Huggingface, found here. These models can be downloaded and evaluated using the OpenLM library.

Example Commands to Train 1B model with DCLM baseline

Here we provide a quick set of commands for setting up, tokenize shuffle, training, and evaluation. For filtering, please see the relevant sections above. For specific details about the below commands, also refer to the relevant sections above.

Setup (with Conda)

conda create -n dclm python=3.10
conda activate dclm

git clone https://github.com/mlfoundations/DCLM.git
cd DCLM
export PYTHONPATH="$(pwd):$PYTHONPATH"

pip install -r requirements.txt

Tokenize Shuffle

Here we include downloading DCLM data using AWS CLI. After running the below command, typically you would need to create a dataset json in exp_data/datasets/tokenized, but we provide one for you in this tutorial (exp_data/datasets/tokenized/dclm_gs3_ls1_rs_tokshuf.json).

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs > rustup.sh
bash rustup.sh -y
source ~/.bashrc
pushd rust_processing/tokshuf-rs
cargo build --release

aws s3 cp --recursive s3://commoncrawl/contrib/datacomp/DCLM-baseline/global-shard_03_of_10/local-shard_1_of_10/ dclm_local
mkdir tokshuf_tmp
mkdir dclm_tokshuf

cargo run --release -- \
--input dclm_local \
--local-cell-dir tokshuf_tmp \
--output dclm_tokshuf \
--tokenizer "EleutherAI/gpt-neox-20b" \
--seqlen 2049 \
--wds-chunk-size 8192 \
--num-local-cells 512

popd

Training

data-config comes from the json created (manually for rust code, automatically for ray) after tokenize shuffle.

torchrun --nproc-per-node 8 -m training.train -- --scale 1b_1x_fast --data-config exp_data/datasets/tokenized/dclm_gs3_ls1_rs_tokshuf.json --logs dclm_rs_tokshuf_training_local_logs --attn-name torch_attn --torchcompile

Evaluation

Run the below command with UUID substituted with the UUID of the model you want to evaluate, which can be found in the json in exp_data/models that was output from the previous training command.

python tools/eval_expdb.py --num_gpus 8 --no_skip --output_dir exp_data/evals/ --eval_yaml "eval/mmlu_and_lowvar.yaml" -f "uuid=UUID" --skip_perplexity

How to Contribute

We welcome contributions to improve the DCLM framework. Please follow our contributing guide for submitting pull requests and reporting issues.

How to Cite Us

If you use our dataset or models in your research, please cite us as follows:

@article{li2024datacomplm,
      title={DataComp-LM: In search of the next generation of training sets for language models},
      author={Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Gadre and Hritik Bansal and Etash Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton and Marianna Nezhurina and Amro Abbas and Cheng-Yu Hsieh and Dhruba Ghosh and Josh Gardner and Maciej Kilian and Hanlin Zhang and Rulin Shao and Sarah Pratt and Sunny Sanyal and Gabriel Ilharco and Giannis Daras and Kalyani Marathe and Aaron Gokaslan and Jieyu Zhang and Khyathi Chandu and Thao Nguyen and Igor Vasiljevic and Sham Kakade and Shuran Song and Sujay Sanghavi and Fartash Faghri and Sewoong Oh and Luke Zettlemoyer and Kyle Lo and Alaaeldin El-Nouby and Hadi Pouransari and Alexander Toshev and Stephanie Wang and Dirk Groeneveld and Luca Soldaini and Pang Wei Koh and Jenia Jitsev and Thomas Kollar and Alexandros G. Dimakis and Yair Carmon and Achal Dave and Ludwig Schmidt and Vaishaal Shankar},
      year={2024},
      journal={arXiv preprint arXiv:2406.11794}
}

When using the DCLM evaluation suite, please make sure to cite all the original evaluation papers. evaluation_bibtex.

When using DCLM for training, please make sure to cite the main training framework dependencies as well. training_bibtex.

License

This project is licensed under the MIT License. See the license file for details.