Awesome

Structure-Level Knowledge Distillation for Multilingual NLP

The code is mainly for our ACL 2020 paper: Structure-Level Knowledge Distillation For Multilingual Sequence Labeling A framework for training unified multilingual models with knowledge distillation, the code is mainly based on flair version 0.4.3 with a lot of modifications. In this repo, we include the following attributes:

Task	Monolingual	Multilingual	Finetuning	Knowledge Distillation	Notes
Sequence Labeling	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	Structure-level knowledge distillation (Wang et al., 2020)
Dependency Parsing	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:x:	State-of-the-Art Parser for Enhanced Universal Dependencies in IWPT 2020 shared task (Wang et al., 2020) and State-of-the-Art Parser for Semantic Dependency Parsing (Wang et al., 2019)

Training Sequence Labelers

Requirements and Installation

The project is based on PyTorch 1.1+ and Python 3.6+.

pip install -r requirements.txt

Teacher Models

Let's train multilingual CoNLL named entity recognition (NER) model as an example. First we need to prepare the teacher models by downloading the pretrained teacher models on google drive and put these models in resources/taggers.

An alternative way is training the teacher models by yourself:

python train_with_teacher.py --config config/multi_bert_origflair_300epoch_2000batch_0.1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0.yaml
python train_with_teacher.py --config config/multi_bert_origflair_300epoch_2000batch_0.1lr_256hidden_en_monolingual_crf_sentloss_10patience_baseline_nodev_ner0.yaml
python train_with_teacher.py --config config/multi_bert_origflair_300epoch_2000batch_0.1lr_256hidden_es_monolingual_crf_sentloss_10patience_baseline_nodev_ner1.yaml
python train_with_teacher.py --config config/multi_bert_origflair_300epoch_2000batch_0.1lr_256hidden_nl_monolingual_crf_sentloss_10patience_baseline_nodev_ner1.yaml

Training the Multilingual Model without M-BERT finetuning

Knowledge Distillation

After all teacher models are ready, we can train the unified multilingual model with

Posterior distillation To reproduce the accuracy in our paper, run:

python train_with_teacher.py --config config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_posterior_2.25temperature_old_relearn_nodev_fast_new_ner0.yaml

We also find that larger temperature can lead to better results:

python train_with_teacher.py --config config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_posterior_4temperature_old_relearn_nodev_fast_new_ner0.yaml

Top-K distillation

python train_with_teacher.py --config config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_1best_old_relearn_nodev_fast_new_ner0.yaml

Top-WK distillation

python train_with_teacher.py --config config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_crfatt_old_relearn_nodev_fast_new_ner0.yaml

Posterior+Top-WK distillation

python train_with_teacher.py --config config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_crfatt_posterior_4temperature_both_old_relearn_nodev_fast_new_ner1.yaml

Training the Multilingual Model with M-BERT finetuning

Finetuning M-BERT without the CRF layer

Following the example of transformers, we use a learning rate of 5e-5 for M-BERT finetuning, run:

python train_with_teacher.py --config config/multi_bert_10epoch_2000batch_0.00005lr_multilingual_nocrf_sentloss_baseline_fast_finetune_relearn_nodev_ner0.yaml

Finetuning M-BERT with the CRF layer

The key for finetuning M-BERT with the CRF layer is setting a larger learning rate for the transition table while the M-BERT layer with a small learning rate (0.5 here), to train the model, run:

python train_with_teacher.py --config config/multi_bert_10epoch_2000batch_0.00005lr_10000lrrate_5decay_800hidden_multilingual_crf_sentloss_baseline_fast_finetune_relearn_nodev_ner0.yaml

Posterior distillation To distill the posterior distribution with finetuning M-BERT model, run:

python train_with_teacher.py --config config/multi_bert_10epoch_10anneal_2000batch_0.00005lr_10000lrrate_5decay_800hidden_multilingual_crf_sentloss_distill_posterior_4temperature_fast_finetune_relearn_nodev_ner1.yaml

Performance

Performance on CoNLL-02/03 NER with finetuning M-BERT are (average over 3 runs):

Finetune	CRF	Knowledge Distillation	English	Dutch	Spanish	German	Average
:heavy_check_mark:	:x:	:x:	91.09	90.34	87.88	82.59	87.97
:heavy_check_mark:	:heavy_check_mark:	:x:	91.47	90.97	88.15	82.80	88.35
:heavy_check_mark:	:heavy_check_mark:	Posterior	91.63	91.38	88.78	83.21	88.75

Training Dependency Parsers

The dependency parsering module is based on the code of parser, our parser is also able to parse the semantic dependency parsing (Oepen et al., 2014) with second-order mean-field variational inference (Wang et al., 2019).

Multilingual Syntactic Dependency Parsing

For multilingal syntactic dependency parsing, we run on Universal Dependencies as an example:

python train_with_teacher.py --config config/multi_bert_1000epoch_0.5inter_3000batch_0.002lr_400hidden_multilingual_nocrf_fast_nodev_dependency0.yaml

Training the model with BERT finetuning:

python train_with_teacher.py --config config/multi_bert_10epoch_0.5inter_3000batch_0.00005lr_20lrrate_multilingual_nocrf_fast_warmup_freezing_beta_weightdecay_finetune_nodev_dependency15.yaml

Note: The performance of Monolingual models have not been evaluated yet, if you want to train a monolingual model, please try the configuration of parser.

Enhanced Universal Dependency (EUD) Parsing

To reproduce our results on EUD Parsing, we provide the conversion scripts for the official dataset. And we also provide our processed training/development/test set for the task. To train the model (here we take the Tamil dataset as an example), run (please refer to config for config files of other languages):

python train_with_teacher.py --config config/xlmr_word_origflair_1000epoch_0.1inter_2000batch_0.002lr_400hidden_ta_monolingual_nocrf_fast_2nd_unrel_250upsample_nodev_enhancedud27.yaml

As we described in the paper, we use the labeled F1 scores (originated from semantic dependency parsing) rather than ELAS for EUD training, therefore if you want to evaluate the ELAS score, first parse the graphs:

python train_with_teacher.py --config config/xlmr_word_origflair_1000epoch_0.1inter_2000batch_0.002lr_400hidden_ta_monolingual_nocrf_fast_2nd_unrel_250upsample_nodev_enhancedud27.yaml --parse --target_dir iwpt2020_test/ta --keep_order --batch_size 1000

Then evaluate the result by the official script: (Note that the official evaluation script does not check the connectivity, if you go strict process of official submission, please fix other validation issues manually. But for the ELAS, the connectivity does not affect the result a lot.)

Semantic Dependency Parsing (SDP)

The code for EUD parsing is also applicable for SDP parsing. We provide a PyTorch version of our second-order SDP parser (For the TensorFlow Version) here. However, we have not evaluate the performance on SDP datasets yet. You may need to modifiy some code and hyper-parameters to run on SDP datasets.

Others

Write Your Own Config File

We provide a detailed description of our config file in config.

GPU Memory

We have update the code for better GPU utilization, therefore training a multilingual sequence labeling with knowledge distillation only needs 8~9 GB for the GPU Memory now rather than 14~15 GB reported in the paper.

Faster Speed

We modified the code of flair for a signficantly faster training speed. For example, we update the CharacterEmbeddings class in embeddings.py to FastCharacterEmbeddings for significantly faster character embedding speed and the WordEmbeddings is updated to FastWordEmbeddings so that the word embeddings can be updated during training. For training sequence labelers, our code is more than 1.5 times faster than the origin version with word and character embeddings.

Citing Us

For Sequence Labelers

Please cite the following paper when training the multilingual sequence labeling models:

@inproceedings{wang-etal-2020-structure,
    title = "Structure-Level Knowledge Distillation For Multilingual Sequence Labeling",
    author = "Wang, Xinyu  and
      Jiang, Yong  and
      Bach, Nguyen  and
      Wang, Tao  and
      Huang, Fei  and
      Tu, Kewei",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.304",
    pages = "3317--3330",
    abstract = "Multilingual sequence labeling is a task of predicting label sequences using a single unified model for multiple languages. Compared with relying on multiple monolingual models, using a multilingual model has the benefit of a smaller model size, easier in online serving, and generalizability to low-resource languages. However, current multilingual models still underperform individual monolingual models significantly due to model capacity limitations. In this paper, we propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models (teachers) to the unified multilingual model (student). We propose two novel KD methods based on structure-level information: (1) approximately minimizes the distance between the student{'}s and the teachers{'} structure-level probability distributions, (2) aggregates the structure-level knowledge to local distributions and minimizes the distance between two local probability distributions. Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.",
}

For Dependency Parsers

If you feel the second-order semantic dependency parser helpful, please cite:

@inproceedings{wang-etal-2019-second,
    title = "Second-Order Semantic Dependency Parsing with End-to-End Neural Networks",
    author = "Wang, Xinyu  and
      Huang, Jingxian  and
      Tu, Kewei",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1454",
    pages = "4609--4618",}

@inproceedings{Wan:Liu:Jia:19,
  author = {Wang, Xinyu and Liu, Yixian and Jia, Zixia
            and Jiang, Chengyue and Tu, Kewei},
  title = {{ShanghaiTech} at {MRP}~2019:
           {S}equence-to-Graph Transduction with Second-Order Edge Inference
           for Cross-Framework Meaning Representation Parsing},
  booktitle = CONLL:19:U,
  address = L:CONLL:19,
  pages = {\pages{--}{55}{65}},
  year = 2019
}

If run experiments on Enhanced Universal Dependencies, please cite:

@inproceedings{wang-etal-2020-enhanced,
    title = "Enhanced {U}niversal {D}ependency Parsing with Second-Order Inference and Mixture of Training Data",
    author = "Wang, Xinyu  and
      Jiang, Yong  and
      Tu, Kewei",
    booktitle = "Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.iwpt-1.22",
    pages = "215--220",
    abstract = "This paper presents the system used in our submission to the \textit{IWPT 2020 Shared Task}. Our system is a graph-based parser with second-order inference. For the low-resource Tamil corpora, we specially mixed the training data of Tamil with other languages and significantly improved the performance of Tamil. Due to our misunderstanding of the submission requirements, we submitted graphs that are not connected, which makes our system only rank \textbf{6th} over 10 teams. However, after we fixed this problem, our system is 0.6 ELAS higher than the team that ranked \textbf{1st} in the official results.",
}

Contact

Please email your questions or comments to Xinyu Wang.