Awesome
cLang-8 Dataset
cLang-8 (“cleaned Lang-8”) is a dataset for grammatical error correction (GEC). The source sentences originate from the popular NAIST Lang-8 Learner Corpora, while the target sentences are generated by our state-of-the-art GEC method called gT5. The method is described in our ACL-IJCNLP 2021 paper.
The paper shows that fine-tuning a T5-11B model on cLang-8 yields SOTA performance on GEC for English. cLang-8 thus simplifies a typical GEC training pipeline consisting of multiple fine-tuning stages.
Dataset Preparation
cLang-8 is generated by combining the target sentences found under targets/
directory of this repository with the source sentences from the original Lang-8
corpus which has to be downloaded separately. Specifically, you need to complete
the following steps:
- Install Git Large File Storage (if not already installed) and clone this repository.
- Fill this form, after which you will receive an email with a link to “the raw format containing all the data up to 2010”.
- Follow the link to download a zip file and extract it.
- Update the
LANG8_DIR
variable inrun.sh
to point to the resulting extracted directory. - Run command
sh run.sh
which will install the required Python 3 dependencies in a virtualenv and align the source and the target sentences.
NB: Running the above script takes about 1 hour when spaCy tokenization is enabled (recommended to make tokenization consistent with CoNLL-14 (see also the next section) and BEA eval sets).
Tokenization Post-Processing for CoNLL-14
After training a model and computing predictions on the CoNLL-14 test set for
the paper, we ran some post-processing steps found in retokenize.py
to fix
tokenization discrepancies. This improves the F0.5 scores by about 2.5 points
(for T5 xxl).
You may instead want to try applying the post-processing steps to cLang-8 targets before training a model.
Data Format
The resulting cLang-8 data files will be saved under ./output_data/
directory
and they will be TSV files with a single tab-separated (source, target) pair per
line. Three separate TSV files will be generated for the following languages:
Language | Number of examples |
---|---|
English | 2,372,119 |
German | 114,405 |
Russian | 44,830 |
How to Cite cLang-8
Please cite the following works if you use cLang-8:
@inproceedings{rothe2021a,
title = {{A Simple Recipe for Multilingual Grammatical Error Correction}},
author = {Rothe, Sascha and Mallinson, Jonathan and Malmi, Eric and Krause, Sebastian and Severyn, Aliaksei},
booktitle = {Proc. of ACL-IJCNLP},
year = {2021}
}
@inproceedings{mizumoto2011mining,
title={{Mining revision log of language learning SNS for automated Japanese error correction of second language learners}},
author={Mizumoto, Tomoya and Komachi, Mamoru and Nagata, Masaaki and Matsumoto, Yuji},
booktitle={Proc. of 5th International Joint Conference on Natural Language Processing},
pages={147--155},
year={2011}
}
License
Similar to the original Lang-8 corpus, cLang-8 is distributed for research and educational purposes only. Specifically, cLang-8 is released under CC BY-NC-SA 4.0 license.
The code is distributed under Apache 2.0 license.
Contact Us
If you have a technical question regarding the dataset, code, or publication, please create an issue in this repository.