Home

Awesome

ReaLiSe

ReaLiSe is a multi-modal Chinese spell checking model.

This the office code for the paper Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking.

The paper has been accepted in ACL Findings 2021.

<img src="assets/model.jpg" width="65%">

Environment

Data

Raw Data

SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html
SIGHAN Bake-off 2014: http://ir.itc.ntnu.edu.tw/lre/clp14csc.html
SIGHAN Bake-off 2015: http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
Wang271K: https://github.com/wdimmy/Automatic-Corpus-Generation

Data Processing

The code and cleaned data are in the data_process directory.

You can also directly download the processed data from this and put them in the data directory. The data directory would look like this:

data
|- trainall.times2.pkl
|- test.sighan15.pkl
|- test.sighan15.lbl.tsv
|- test.sighan14.pkl
|- test.sighan14.lbl.tsv
|- test.sighan13.pkl
|- test.sighan13.lbl.tsv

Pretrain

You can also directly download the pretrained and merged BERT, Phonetic Encoder, and Graphic Encoder from this, and put them in the pretrained directory:

pretrained
|- pytorch_model.bin
|- vocab.txt
|- config.json

Train

After preparing the data and pretrained model, you can train ReaLiSe by executing the train.sh script. Note that you should set up the PRETRAINED_DIR, DATE_DIR, and OUTPUT_DIR in it.

sh train.sh

Test

Test ReaLiSe using the test.sh script. You should set up the DATE_DIR, CKPT_DIR, and OUTPUT_DIR in it. CKPT_DIR is the OUTPUT_DIR of the training process.

sh test.sh

Well-trained Model

You can also download well-trained model from this direct using. The performance scores of RealiSe and some baseline models on the SIGHAN13, SIGHAN14, SIGHAN15 test set are here:

Methods

Metrics

SIGHAN15

MethodD-AD-PD-RD-FC-AC-PC-RC-F
FASpell74.267.660.063.573.766.659.162.6
Soft-Masked BERT80.973.773.273.577.466.766.266.4
SpellGCN-74.880.777.7-72.177.775.9
BERT82.474.278.076.181.071.675.373.4
ReaLiSe84.777.381.379.384.075.979.977.8

SIGHAN14

MethodD-AD-PD-RD-FC-AC-PC-RC-F
Pointer Network-63.282.571.6-79.368.973.7
SpellGCN-65.169.567.2-63.167.265.3
BERT75.764.568.666.574.662.466.364.3
ReaLiSe78.467.871.569.677.766.370.068.1

SIGHAN13

MethodD-AD-PD-RD-FC-AC-PC-RC-F
FASpell63.176.263.269.160.573.160.566.2
SpellGCN78.885.778.882.177.884.677.881.0
BERT77.085.077.080.877.483.075.278.9
ReaLiSe82.788.682.585.481.487.281.284.1

Citation

@misc{xu2021read,
      title={Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking}, 
      author={Heng-Da Xu and Zhongli Li and Qingyu Zhou and Chao Li and Zizhen Wang and Yunbo Cao and Heyan Huang and Xian-Ling Mao},
      year={2021},
      eprint={2105.12306},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}