Awesome
STIL - Simultaneous Slot Filling, Translation, Intent Classification, and Language Identification: Initial Results using mBART on MultiATIS++
A paper presented at the Asian Chapter of the Association of Computational Linguistics (AACL) 2020
Jack FitzGerald
Copyright Amazon.com Inc. or its affiliates
This repository contains some of the code used in the paper named above. The purpose of this repo is only to allow other researchers to reproduce the study and results presented in the paper. Jack and Amazon likely will not improve this repo over time.
Setup
I used AWS SageMaker for this work, including a m5.xlarge instance for data preparation and analysis and a p3.16xlarge instance for training.
Download the pretrained model
This study uses the pretrained mBART CC25 model. It can be downloaded from fairseq here.
Create the datasets
The dataset is based on the MultiATIS++ and MultiATIS datasets.
MultiATIS++
The 2020 paper by Xue et al. entitled "End-to-End Slot Alignment and Recognition for Cross-Lingual NLU" describes the dataset in more detail. As of writing, the dataset was still under review by LDC. Please contact saabm@amazon.com to obtain a copy.
Once you have the data, place it in a folder called MultiATISpp-RAW/
.
At the time of my research, there were a small number of English alignment problems with the Japanese, Hindi, and Turkish data. For this reason, Japanese was excluded, and the Hindi and Turkish data from MultiATIS were used. To ensure a fair comparison with the work by Xu et al., we need to extract the validation set from MultiATIS++ and remove those examples from the MultiATIS training set. Another folder must be created called hi_tr_devsets
containing those two files.
The TSVs should have the following columns, and headers should be included in the files.
id
utterance
slot_labels
intent
MultiATIS
MultiATIS is available from LDC here. Please place the TSVs for Hindi and Turkish in a folder called `MultiATIS-RAW/'. The TSVs should have the following columns, and headers should be excluded.
English utterance
English annotations
machine translation back to English
intent
non-English utterance
non-English annotations
Directory tree for raw data
MultiATISpp-RAW/
|- dev_DE.tsv
|- dev_ES.tsv
|- dev_ZH.tsv
|- test_EN.tsv
|- test_FR.tsv
|- train_DE.tsv
|- train_ES.tsv
|- train_ZH.tsv
|- dev_EN.tsv
|- dev_FR.tsv
|- test_DE.tsv
|- test_ES.tsv
|- test_ZH.tsv
|- train_EN.tsv
|- train_FR.tsv
MultiATIS-RAW/
|- Hindi-test.tsv
|- Hindi-train_1600.tsv
|- Turkish-test.tsv
|- Turkish-train_638.tsv
hi_tr_devsets/
|- dev_HI.tsv
|- dev_TR.tsv
Create the STIL dataset
To create the STIL dataset, run the preprocess_atis_stil.py
script on the data described above. EX:
python path/to/preprocess_atis_stil.py MultiATISpp-RAW/ MultiATIS-RAW/ hi_tr_devsets/ MultiATISpp-FLAT/
Create the traditional NLU dataset
To create the traditional NLU dataset (no translation of the slots), run the preprocess_atis_traditional.py
script on the data desscribed above. EX:
python path/to/preprocess_atis_traditional.py MultiATISpp-RAW/ MultiATIS-RAW/ hi_tr_devsets/ MultiATISpp-FLAT/
Tokenize the dataset
The mBART model uses sentencepiece tokenization. Information can be found in the sentencepice repo. The following commands can be used to build sentencepiece.
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v
pip install sentencepiece
Once sentencepiece has been built, tokenize the datasets:
SPM=path/to/sentencepiece/build/src/spm_encode
MODEL=path/to/mbart.cc25/sentence.bpe.model
DATA_PATH=path/to/MultiATISpp-FLAT
for SPLIT in train dev test; do for INOUT in input output; do $SPM --model=$MODEL < ${DATA_PATH}/${SPLIT}.${INOUT} > ${DATA_PATH}/${SPLIT}.spm.${INOUT}; done; done
Binarize the data
The model requires binarized data.
The gcc
in my instances of SageMaker was too old. Upgrade first if needed:
conda install -c psi4 gcc-5
Install fairseq as editable:
git clone https://github.com/pytorch/fairseq.git
cd fairseq
pip install --editable .
Run binarization:
FAIRSEQ_PATH=path/to/fairseq
DATA_PATH=path/to/tokenized_data
DICT_PATH=path/to/mbart.cc25
python ${FAIRSEQ_PATH}/preprocess.py --source-lang input --target-lang output --trainpref ${DATA_PATH}/train.spm --validpref ${DATA_PATH}/dev.spm --testpref ${DATA_PATH}/test.spm --srcdict $DICT_PATH/dict.txt --tgtdict $DICT_PATH/dict.txt --workers 8 --destdir MultiATISpp-BIN
Train the mBART model
I used a p3.16xlarge instance for training, which has 8 Nvidia v100 GPUs. By using max sentence of 2 and update freq of 2, it will result in a batch size of 32.
PRETRAINED_BART=path/to/mbart.cc25
DATA_PATH=path/to/data-bin
FAIRSEQ_PATH=path/to/fairseq
CHECKPOINT_PATH=path/to/checkpoints
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
python ${FAIRSEQ_PATH}/train.py ${DATA_PATH} --num-workers 32 --encoder-normalize-before --decoder-normalize-before --arch mbart_large --task translation_from_pretrained_bart --source-lang input --target-lang output --criterion label_smoothed_cross_entropy --label-smoothing 0.2 --dataset-impl mmap --optimizer adam --adam-eps 1e-08 --adam-betas '(0.9, 0.999)' --lr-scheduler polynomial_decay --lr 3e-05 --min-lr -1 --warmup-updates 936 --total-num-update 20000 --dropout 0.2 --attention-dropout 0.1 --weight-decay 0.01 --max-sentences 2 --update-freq 2 --save-interval 1 --max-epoch 40 --save-dir ${CHECKPOINT_PATH} --validate-interval 1 --seed 222 --log-format json --log-interval 60 --reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler --restore-file ${PRETRAINED_BART}/model.pt --langs $langs --layernorm-embedding --ddp-backend no_c10d --memory-efficient-fp16 |& tee train_history.log
Watch GPU utilization:
nvidia-smi -l
Parse training and validation losses as tables from the log file:
python path/to/parse_fairseq_train_logs.py train_history.log prefix_for_output_file
Run inference on the test data
This command will use 8 shards of data. Be sure to pick the right model checkpoint based on validation curves, etc.
for SHARD_ID in {0..7}; do (CUDA_VISIBLE_DEVICES=$SHARD_ID python $FAIRSEQ_PATH/generate.py data-bin/ --path $CHECKPOINT_PATH/checkpoint19.pt --task translation_from_pretrained_bart --gen-subset test -t output -s input --sacrebleu --remove-bpe 'sentencepiece' --langs $langs --memory-efficient-fp16 --max-sentences 64 --num-workers 4 --num-shards 8 --shard-id $SHARD_ID |& tee hyps_test19_${SHARD_ID}.log &); done
Combine the data from the 8 shards into one file:
for file in hyps_test19_*; do cat $file >> hyps_test_epoch19.log; done
rm hyps_test19_*
License
See the file entitled LICENSE
Note: This work is dependent on fairseq
and sentencepiece
, which were licensed under the MIT License and the Apache 2.0 license, respectively, at the time this work was conducted.
Citation
@inproceedings{fitzgerald2020mbartmultiatis,
title = {STIL - Simultaneous Slot Filling, Translation, Intent Classification, and Language Identification: Initial Results using mBART on MultiATIS++},
author = {Jack G. M. FitzGerald},
booktitle = {Proceedings of 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics},
year = {2020},
url = {https://arxiv.org/abs/2010.00760}
}