Awesome
Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models
This is the official code for paper titled "Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models".
We propose a method, which continually identifies the weak spots of a model to generate more valuable training instances, and apply a task-specific pre-training strategy to enhance the model. Experimental results show that such an adversarial training method combined with the pre-training strategy can improve both the generalization and robustness of multiple CSC models across three different datasets, achieving state-of-the-art performance for CSC task.
Requirements
For BERT and Soft-Masked BERT:
- python==3.7
- pytorch==1.4.0
- transformers==3.4.0
For SpellGCN, we borrow some codes from SpellGCN, thus our requirements are the same with their.
- Tensorflow==1.13.1
- python==2.7
- "BERT-Base, Chinese" from google-research
How to run?
1. Prepare the datasets:
- For pre-train:
- For train:
- Download the additional 270K data samples from here.
- Extract the training samples from the file "train.sgml".
- Note: The data samples mentioned above are absent here due to the lack of permission.
2. Run the models:
- For BERT and Soft-Masked BERT:
-
Set up an virtual environment for BERT and Soft-Masked BERT(python==3.7,torch==1.4.0,transformers==3.4.0) using Anaconda
conda create -n bert python=3.7.9 conda activate bert pip install torch==1.4.0 pip install transformers==3.4.0
-
Go to the directory "scripts", set up your private parameters(like the path of initial model and data)
cd scripts vim run.sh
-
bash run.sh
bash run.sh
- For SpellGCN:
-
Set up an virtual environment for SpellGCN (python==2.7, Tensorflow==1.13.1) using Anaconda
conda create -n spellgcn python=2.7.1 source activate spellgcn pip install tensorflow==1.13.1
-
Go to the directory "scripts", set up your private parameters (like the path of BERT and initial model)
cd scripts vim run.sh
-
bash run.sh
bash run.sh
3. Or you can download the models you need and initialize your models from them.
-
Baidu Wangpan:
-
链接:https://pan.baidu.com/s/1O9mLjWSiXzxcPBy0fU-_BQ 提取码:y25e
-
Contact
chongli17@fudan.edu.cn and cenyuanzhang17@fudan.edu.cn
How to cite our paper?
@inproceedings{li-etal-2021-2Ways,
author = {Chong Li and
Cenyuan Zhang and
Xiaoqing Zheng and
Xuanjing Huang},
title="Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models",
booktitle="Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing",
publisher = "Association for Computational Linguistics",
year="2021"
}