Awesome
Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction
Installation
The following command installs all necessary packages:
pip install -r requirements.txt
python -m spacy download en_core_web_sm
The project was tested using Python 3.7.
Datasets
All the public GEC datasets can be downloaded from here.<br> Knowledge distilled datasets can be downloaded here.<br> Synthetically PIE created datasets can be generated/downloaded here.<br>
To train the model data has to be preprocessed and converted to special format with the command:
python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE
Pretrained models
All available pretrained models can be downloaded here.<br>
<table> <tr> <th>Pretrained encoder</th> <th>Confidence bias</th> <th>Min error prob</th> <th>BEA-2019 (test)</th> </tr> <tr> <th>RoBERTa <a href="https://drive.google.com/drive/folders/1Si2hwmskb7QxqSFtPBsivl_FujkR3p6l?usp=sharing">[link]</a></th> <th>0.1</th> <th>0.65</th> <th>73.1</th> </tr> <tr> <th>Large RoBERTa voc10k + DeBERTa voc10k + XLNet voc 5k <a href="https://drive.google.com/drive/folders/1SzkzVdjP30eWpHUvP5-BXMWu3szsf9Rt?usp=sharing">[link]</a></th> <th>0.3</th> <th>0.55</th> <th>76.05</th> </tr> </table>Train model
To train the model, simply run:
python train.py --train_set TRAIN_SET --dev_set DEV_SET \
--model_dir MODEL_DIR
There are a lot of parameters to specify among them:
cold_steps_count
the number of epochs where we train only last linear layertransformer_model {bert, roberta, deberta, xlnet, bert-large, roberta-large, deberta-large, xlnet-large}
model encodertn_prob
probability of getting sentences with no errors; helps to balance precision/recall
In our experiments we had 98/2 train/dev split on each training stage.
Model inference
To run your model on the input file use the following command:
python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
--vocab_path VOCAB_PATH --input_file INPUT_FILE \
--output_file OUTPUT_FILE
Among parameters:
min_error_probability
- minimum error probability (as in the paper)additional_confidence
- confidence bias (as in the paper)special_tokens_fix
to reproduce some reported results of pretrained models
Ensembling by averaging of output tag probabilities
For evaluating ensemble you need to name your models like "xlnet_1_SOMETHING.th", "roberta_1_SOMETHING.th" and pass them all to model_path
parameter. You also need to set is_ensemble
parameter.
python predict.py --model_path MODEL_PATH MODEL_PATH [MODEL_PATH ...] \
--vocab_path VOCAB_PATH --input_file INPUT_FILE \
--output_file OUTPUT_FILE \
--is_ensemble 1
Ensembling by majority votes on output edit spans
For this ensemble, you first need to predict output files by singel models and them combine these files by script
python ensemble.py --source_file SOURCE_FILE \
--target_files TARGET_FILE TARGET_FILE [TARGET_FILE ...]
--output_file OUTPUT_FILE
Evaluation
For evaluation we use ERRANT.
Citation
If you find this work is useful for your research, please cite our paper:
Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction
@inproceedings{tarnavskyi-etal-2022-improved-gector,
title = "Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction",
author = "Tarnavskyi, Maksym and Chernodub, Artem and Omelianchuk, Kostiantyn",
booktitle = "Accepted for publication at 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)",
month = May,
year = "2022",
address = "Dublin, Ireland",
url = "https://arxiv.org/pdf/2203.13064.pdf",
abstract = "In this paper, we investigate improvements to the GEC sequence tagging architecture with a focus on ensembling of recent cutting-edge Transformer-based encoders in Large configurations. We encourage ensembling models by majority votes on span-level edits because this approach is tolerant to the model architecture and vocabulary size. Our best ensemble achieves a new SOTA result with an F0.5 score of 76.05 on BEA-2019 (test), even without pretraining on synthetic datasets. In addition, we perform knowledge distillation with a trained ensemble to generate new synthetic training datasets, "Troy-Blogs" and "Troy-1BW". Our best single sequence tagging model that is pretrained on the generated Troy- datasets in combination with the publicly available synthetic PIE dataset achieves a near-SOTA result with an F0.5 score of 73.21 on BEA-2019 (test). The code, datasets, and trained models are publicly available.",
}