Awesome
BabyLM Data Preprocessing
This repo contains instructions for reproducing the preprocessing pipeline for the BabyLM challenge.
NOTE: It is not necessary to reproduce these steps to use the BabyLM data. The complete preprocessed dataset can be downloaded directly on the BabyLM website!!
Initialize stuff
Start out in the data_preprocessing
directory which contains this README.md
.
conda create --name babylm_preprocessing
conda activate babylm_preprocessing
PROJECT_DIR=<YOUR_PATH_TO>/babylm_data_preprocessing
mkdir ${PROJECT_DIR}/tmp
mkdir ${PROJECT_DIR}/preprocessed_data
Direct downloads of preprocessed data
Some data is already downloadable in a nice preprocessed form. Let's download those first.
These first three sources were preprocessed by others:
CHILDES
cd ${PROJECT_DIR}/preprocessed_data
curl https://raw.githubusercontent.com/phueb/BabyBERTa/master/data/corpora/aochildes.txt > aochildes.txt
Switchboard
cd ${PROJECT_DIR}/preprocessed_data
curl https://raw.githubusercontent.com/NathanDuran/Switchboard-Corpus/master/swda_data/full_set.txt > switchboard.txt
Children's Book Test
cd ${PROJECT_DIR}/tmp
curl http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz > CBTest.tgz
tar -xvzf CBTest.tgz
mv CBTest/data/cbt_* ${PROJECT_DIR}/preprocessed_data/
The next one is available in preprocessed form, but for some reason the original download links don't work with curl
.
So I've uploaded these files to google drive, where they can be downloaded easily from command line:
Children stories
The original link which doesn't work with curl
: https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus/download?datasetVersionNumber=1
cd ${PROJECT_DIR}/preprocessed_data
gdown 1nbUCWCAvtqI1-WQxzmyqQmddgsZtzdpR
unzip children_stories.txt.zip
rm children_stories.txt.zip
The next two files were preprocessed by Haau-Sing Li for a previous project and shared directly with Alex Warstadt. We do not have easy access to the preprocessing pipeline, so we share the preprocessed files directly on google drive:
OpenSubtitles
Preprocessing for OpenSubtitles included removing duplicate or near-duplicate documents.
cd ${PROJECT_DIR}/tmp
gdown 1vW0o7K6Gj_IYTzriWEjmCnrylCWb8DbY
unzip open_subtitles.txt.zip
mv open_subtitles.txt ../preprocessed_data
Wikipedia
cd ${PROJECT_DIR}/preprocessed_data
gdown 19GipY95MW3LrfO_kArmIC0KYy7mfCb1l
unzip wikipedia.txt.zip
rm wikipedia.txt.zip
Datasets that require substantial preprocessing
QED
The original link which doesn't work with curl
: https://opus.nlpl.eu/download.php?f=QED/v2.0a/xml/en.zip
cd ${PROJECT_DIR}/tmp
gdown 1R2xWtNeVX48RiFA7vErL1pNtws3XEsYP
unzip qed.zip
cd ${PROJECT_DIR}
python preprocess_qed.py tmp/en tmp/qed
cat tmp/qed/* >> preprocessed_data/qed.txt
Simple Wiki
cd ${PROJECT_DIR}/tmp
curl https://dumps.wikimedia.org/simplewiki/20221201/simplewiki-20221201-pages-articles.xml.bz2 > wiki.bz2
bzip2 -d wiki.bz2
python -m wikiextractor.WikiExtractor wiki
cd $PROJECT_DIR
python preprocess_simple_wiki.py
Spoken BNC
Initialize stuff & downlaod data
BNC_TMP=${PROJECT_DIR}/tmp/bnc_spoken
mkdir $BNC_TMP
cd $BNC_TMP
curl https://llds.ling-phil.ox.ac.uk/llds/xmlui/bitstream/handle/20.500.14106/2554/2554.zip > bnc.zip
unzip -q bnc.zip
rm bnc.zip
Now it's time to select only .xml
files that came from the spoken domain
(
for z in download/Texts/*;
do for y in $z/*;
do for x in $y/*;
do sed '2q;d' $x | grep "^<stext" -q && cp $x ${BNC_TMP};
done;
done;
done
)
rm -rf download
Finally, run this nice python script to extract the text from the .xml
files
python preprocess_bnc.py tmp/bnc_spoken/ bnc_spoken.txt
rm -rf tmp/bnc_spoken
Gutenberg
Here we just follow the README in the gutenberg repo. Any issues should be directed to the authors of gutenberg
cd ${PROJECT_DIR}
git clone https://github.com/pgcorpus/gutenberg.git
cd gutenberg
#To install any missing dependencies, just run
pip install -r requirements.txt
## Getting & processing the data
python get_data.py
#This will download a copy of all UTF-8 books in PG and will create a csv file with metadata (e.g. author, title, year, ...).
#Notice that if you already have some of the data, the program will only download those you are missing (we use `rsync` for this). It is hence easy to update the dataset periodically to keep it up-to-date by just running `get_data.py`.
python process_data.py
Now we grab the English portion and put it all together in one file
cd ${PROJECT_DIR}
python get_gutenberg_modern_en.py
cat tmp/gutenberg_modern_en/* >> preprocessed_data/gutenberg.txt
Sampling and splitting data
cd ${PROJECT_DIR}
. sample_chunks_and_split.sh
. sample_chunks_and_split_small.sh