Awesome
Multidomain Language Modeling Data Utilities
This repository contains data utilities for "DEMix Layers: Disentangling Domains for Modular Language Modeling" (Gururangan et. al, 2021).
This code is generic; it can be used to build any multi-domain, metadata-tagged datasets in a format compatible with Fairseq for language modeling. We also provide download links to data necessary to reproduce results in the DEMix paper.
General Overview
In the DEMix paper, we assume a sharded dataset structure across domains, where the dataset is split among many folders, and each folder contains many files, each containing a single document. We found this format to be particularly amenable to efficient PyTorch dataloading, and this follows the Openwebtext dataset format.
The processing steps below generally build the following files:
- A
shards/
folder, which contains a sharded version of the dataset for efficient Pytorch data loading. - A
data-bin/
folder, which contains data binaries for training and evaluation of language models in Fairseq - A
metadata/
folder, which containsfilenames.txt
, an index of the paths to all files in your dataset, and ametadata.jsonl
, a json-lines file which contains per-document metadata. The former is used for faster data loading, and the later is used for finer-grained filtering of documents based on certain metadata.
In this tutorial, we use the example datasets in the example_domains/
directory to build these necessary folders and files. You can use the same process on any data of any size, provided that the original input data is in a .jsonl
format.
Installation
conda env create --name demix -f environment.yml
conda activate demix
First, set your DATA_DIR
to the root directory where you will be housing the domain directories.
export DATA_DIR=$(pwd)/example_domains
Download data
You can download example domains for this tutorial here:
bash scripts/download_example_domains.sh
We include the legal contracts and ACL papers domains in the example_domains
directory already.
Check this file for more information on how to download the data used in the DEMix paper.
Preprocess data
We next want preprocess all the datasets into fairseq data-bins. We've made this easy with a script:
bash scripts/preprocess_example_domains.sh
Otherwise, you can follow along below to understand each preprocessing step.
We will first preprocess the imdb
domain.
export DOMAIN=imdb
Shard Data
python -m domain_loader.shard_dataset \
--domain $DOMAIN \
--input-file example_domains/$DOMAIN/$DOMAIN.jsonl \
--batch-size 512 \
--text-field text
Build metadata/filenames.txt
To make data loading faster, we first gather a list of filenames in a separate file ${DOMAIN}/metadata/filenames.txt
. To build this file, use domain_loader/build_filenames.py
.
python -m domain_loader.scan_filenames --domain $DOMAIN
Split data into train, dev, and test files
First, count the total whitespace tokens in a domain:
python -m domain_loader.count_words --domain $DOMAIN
Then use these word counts to set the total number of tokens for train, dev, and test splits by editing domain_loader/constants.py
.
Then make the data splits:
python -m domain_loader.make_splits \
--domain $DOMAIN \
--num-workers 0 \
--batch-size 1 \
--output-dir $DATA_DIR/$DOMAIN/splits
Build fairseq data-bin
Download the gpt2 vocabulary:
mkdir ${DATA_DIR}/gpt2_bpe
curl -Lo ${DATA_DIR}/gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
curl -Lo ${DATA_DIR}/gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
curl -Lo ${DATA_DIR}/gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
bash scripts/pretokenize.sh ${DATA_DIR}/$DOMAIN/splits
bash scripts/preprocess.sh ${DATA_DIR}/$DOMAIN/splits $DOMAIN ${DATA_DIR}/data-bin/
These scripts will output a data-bin
files in ${DATA_DIR}/data-bin/
, which you can train on with fairseq LMs.
Building multi-domain datasets
Building a multi-domain dataset follows the same procedure above, except you just add multiple domains in the same data-bin folder (i.e., ${DATA_DIR}/data-bin/
).
You can apply the same process to the all other domains in the example_domains
folder, e.g.:
export DOMAIN=ag_news
python -m domain_loader.shard_dataset \
--domain $DOMAIN \
--input-file example_domains/$DOMAIN/$DOMAIN.jsonl \
--batch-size 512 \
--text-field text
python -m domain_loader.scan_filenames --domain $DOMAIN
python -m domain_loader.count_words --domain $DOMAIN
## set token counts for "ag_news" in domain_loader/constants.py
python -m domain_loader.make_splits \
--domain $DOMAIN \
--num-workers 0 \
--batch-size 1 \
--output-dir $DATA_DIR/$DOMAIN/splits
bash scripts/pretokenize.sh ${DATA_DIR}/$DOMAIN/splits
bash scripts/preprocess.sh ${DATA_DIR}/$DOMAIN/splits $DOMAIN ${DATA_DIR}/data-bin/
Check out bash scripts/preprocess_example_domains.sh
for other examples.
Train a multi-domain LM
Check out the DEMix repo to see how to train an LM on these data-bins.