Awesome
JGLUE: Japanese General Language Understanding Evaluation
JGLUE, Japanese General Language Understanding Evaluation, is built to measure the general NLU ability in Japanese. JGLUE has been constructed from scratch without translation. We hope that JGLUE will facilitate NLU research in Japanese.
JGLUE has been constructed by a joint research project of Yahoo Japan Corporation and Kawahara Lab at Waseda University.
Tasks/Datasets
JGLUE consists of the tasks of text classification, sentence pair classification, and QA. Each task consists of
multiple datasets. Each dataset can be found under the datasets
directory. Only train/dev sets are available now, and the test set will be available after the leaderboard is made public. We use Yahoo! Crowdsourcing for all crowdsourcing tasks in constructing the datasets.
Task | Dataset | Train | Dev | Test |
---|---|---|---|---|
Text Classification | MARC-ja | 187,528 | 5,654 | 5,639 |
JCoLA† | - | - | - | |
Sentence Pair Classification | JSTS | 12,451 | 1,457 | 1,589 |
JNLI | 20,073 | 2,434 | 2,508 | |
QA | JSQuAD | 62,859 | 4,442 | 4,420 |
JCommonsenseQA | 8,939 | 1,119 | 1,118 |
†JCoLA will be added soon.
Dataset Description
(The task guidelines and user inteface screenshots used for constructing data are presented in task_guidelines.md.)
MARC-ja
MARC-ja is a dataset of the text classification task. This dataset is based on the Japanese portion of Multilingual Amazon Reviews Corpus (MARC) (Keung+, 2020).
We performed the following modifications to the original dataset:
- To make it easy for both humans and computers to judge a class label, we cast the text classification task as a binary classification task, where 1 and 2-star ratings are converted to
negative
, and 4 and 5 are converted topositive
. We do not use reviews with a 3-star rating. - There are some instances where the rating diverges from a review text. To improve the quality of the dev/test instances, we crowdsource a positive/negative judgment task, adopt only the reviews with the same votes from seven or more out of 10 workers and assign a label of the maximum votes to these reviews.
We don't distribute the dataset itself. Please download the original dataset, and run a conversion script as follows:
- Download https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
- Run the following commands:
$ pip install -r preprocess/requirements.txt
$ cd preprocess/marc-ja/scripts
$ gzip -dc /somewhere/amazon_reviews_multilingual_JP_v1_00.tsv.gz | \
python marc-ja.py \
--positive-negative \
--output-dir ../../../datasets/marc_ja-v1.1 \
--max-char-length 500 \
--filter-review-id-list-valid ../data/filter_review_id_list/valid.txt \
--label-conv-review-id-list-valid ../data/label_conv_review_id_list/valid.txt
The train and valid sets will be generated under the datasets/marc_ja-v1.1
directory.
When you use this dataset, please follow the license of Multilingual Amazon Reviews Corpus (MARC).
JSTS
JSTS is a Japanese version of the STS (Semantic Textual Similarity) dataset. STS is a task to estimate the semantic similarity of a sentence pair. The sentences in JSTS and JNLI (described below) are extracted from the Japanese version of the MS COCO Caption Dataset, the YJ Captions Dataset (Miyazaki and Shimizu, 2016).
{"sentence_pair_id": "691",
"yjcaptions_id": "127202-129817-129818",
"sentence1": "街中の道路を大きなバスが走っています。 (A big bus is running on the road in the city.)",
"sentence2": "道路を大きなバスが走っています。 (There is a big bus running on the road.)",
"label": 4.4}
(Note that English translations are added in this example for those who do not understand Japanese, and are not included in the dataset.)
Name | Description |
---|---|
sentence_pair_id | id |
yjcaptions_id | sentence ids in yjcaptions (explained below) |
sentence1 | first sentence |
sentence2 | second sentence |
label | sentence similarity: 5 (equivalent meaning) - 0 (completely different meaning) |
Explanation for yjcaptions_id
There are the following two cases:
- sentence pairs in one image:
(image id)-(sentence1 id)-(sentence2 id)
- e.g., 723-844-847
- a sentence id starting with "g" means a sentence generated by a crowdworker (e.g., 69501-75698-g103): only for JNLI
- sentence pairs in two images:
(image id of sentence1)_(image id of sentence2)-(sentence1 id)-(sentence2 id)
- e.g., 91337_217583-96105-91680
JNLI
JNLI is a Japanese version of the NLI (Natural Language Inference) dataset. NLI is a task to recognize the inference relation that a premise sentence has to a hypothesis sentence. The inference relations are entailment
, contradiction
, and neutral
.
{"sentence_pair_id": "1157",
"yjcaptions_id": "127202-129817-129818",
"sentence1": "街中の道路を大きなバスが走っています。 (A big bus is running on the road in the city.)",
"sentence2": "道路を大きなバスが走っています。 (There is a big bus running on the road.)",
"label": "entailment"}
Name | Description |
---|---|
sentence_pair_id | id |
yjcaptions_id | sentence ids in yjcaptions |
sentence1 | premise sentence |
sentence2 | hypothesis sentence |
label | inference relation |
JSQuAD
JSQuAD is a Japanese version of SQuAD (Rajpurkar+, 2016), one of the datasets of reading comprehension. Each instance in the dataset consists of a question regarding a given context (Wikipedia article) and its answer. JSQuAD is based on SQuAD 1.1 (there are no unanswerable questions). We used the Japanese Wikipedia dump as of 20211101.
The json format is the same as the original SQuAD.
{
"title": "東海道新幹線 (Tokaido Shinkansen)",
"paragraphs": [
{
"qas": [
{
"question": "2020年(令和2年)3月現在、東京駅 - 新大阪駅間の最高速度はどのくらいか。 (What is the maximum speed between Tokyo Station and Shin-Osaka Station as of March 2020?)",
"id": "a1531320p0q0",
"answers": [
{
"text": "285 km/h",
"answer_start": 182
}
],
"is_impossible": false
},
{
..
}
],
"context": "東海道新幹線 [SEP] 1987年(昭和62年)4月1日の国鉄分割民営化により、JR東海が運営を継承した。西日本旅客鉄道(JR西日本)が継承した山陽新幹線とは相互乗り入れが行われており、東海道新幹線区間のみで運転される列車にもJR西日本所有の車両が使用されることがある。2020年(令和2年)3月現在、東京駅 - 新大阪駅間の所要時間は最速2時間21分、最高速度285 km/hで運行されている。"
}
]
}
Name | Description |
---|---|
title | title of a Wikipedia article |
paragraphs | a set of paragraphs |
qas | a set of pairs of a question and its answer |
question | question |
id | id of a question |
answers | a set of answers |
text | answer text |
answer_start | start position (character index) |
is_impossible | all the values are false |
context | a concatenation of the title and paragraph |
JCommonsenseQA
JCommonsenseQA is a Japanese version of CommonsenseQA (Talmor+, 2019), which is a multiple-choice question answering dataset that requires commonsense reasoning ability. It is built using crowdsourcing with seeds extracted from the knowledge base ConceptNet.
{"q_id": 3016,
"question": "会社の最高責任者を何というか? (What do you call the chief executive officer of a company?)",
"choice0": "社長 (president)",
"choice1": "教師 (teacher)",
"choice2": "部長 (manager)",
"choice3": "バイト (part-time worker)",
"choice4": "部下 (subordinate)",
"label": 0}
Name | Description |
---|---|
q_id | id |
question | question |
choice{0..4} | choice |
label | correct choice id |
Baseline Scores
The following foundation models are used for the evaluation.
Model | Basic Unit | Pretraining Texts |
---|---|---|
Tohoku BERT base | subword<br>(MeCab + BPE) | Japanese Wikipedia |
Tohoku BERT base (char) | character | Japanese Wikipedia |
NICT BERT base | subword<br>(MeCab + BPE) | Japanese Wikipedia |
Waseda RoBERTa base | subword<br>(Juman++ + Unigram LM) | Japanese Wikipedia + CC |
XLM RoBERTa base | subword<br>(Unigram LM) | multi-lingual CC |
Note that the large-sized models are also used corresponding to Tohoku BERT base, Waseda RoBERTa base and XLM RoBERTa base. For Waseda RoBERTa large, the following two versions with different maximum sequence lengths are used: Waseda RoBERTa large (s128) and Waseda RoBERTa large (s512).
When you use NICT BERT base or Waseda RoBERTa base models, the dataset text should be segmented into words by the following corresponding morphological analyzer in advance:
Please refer to preprocess/morphological-analysis/README.md.
The fine-tuning was performed using the transformers library provided by Hugging Face. See fine-tuning/README.md for details.
The performance along with human scores on the JGLUE dev set is shown below.
Model | MARC-ja | JSTS | JNLI | JSQuAD | JCommonsenseQA |
---|---|---|---|---|---|
acc | Pearson/Spearman | acc | EM/F1 | acc | |
Human | 0.989 | 0.899/0.861 | 0.925 | 0.871/0.944 | 0.986 |
Tohoku BERT base | 0.958 | 0.909/0.868 | 0.899 | 0.871/0.941 | 0.808 |
Tohoku BERT base (char) | 0.956 | 0.893/0.851 | 0.892 | 0.864/0.937 | 0.718 |
Tohoku BERT large | 0.955 | 0.913/0.872 | 0.900 | 0.880/0.946 | 0.816 |
NICT BERT base | 0.958 | 0.910/0.871 | 0.902 | 0.897/0.947 | 0.823 |
Waseda RoBERTa base | 0.962 | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
Waseda RoBERTa large (s128) | 0.954 | 0.930/0.896 | 0.924 | 0.884/0.940 | 0.907 |
Waseda RoBERTa large (s512) | 0.961 | 0.926/0.892 | 0.926 | 0.918/0.963 | 0.891 |
XLM RoBERTa base | 0.961 | 0.877/0.831 | 0.893 | -/-† | 0.687 |
XLM RoBERTa large | 0.964 | 0.918/0.884 | 0.919 | -/-† | 0.840 |
†XLM RoBERTa base/large models use the unigram language model as a tokenizer and they are excluded from the JSQuAD evaluation because the token delimitation and the start/end of the answer span often do not match, resulting in poor performance.
Leaderboard
A leaderboard will be made public soon. The test set will be released at that time.
Reference
@article{栗原 健太郎2023,
title={JGLUE: 日本語言語理解ベンチマーク},
author={栗原 健太郎 and 河原 大輔 and 柴田 知秀},
journal={自然言語処理},
volume={30},
number={1},
pages={63-87},
year={2023},
url = "https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_article/-char/ja",
doi={10.5715/jnlp.30.63}
}
@inproceedings{kurihara-etal-2022-jglue,
title = "{JGLUE}: {J}apanese General Language Understanding Evaluation",
author = "Kurihara, Kentaro and
Kawahara, Daisuke and
Shibata, Tomohide",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.lrec-1.317",
pages = "2957--2966",
abstract = "To develop high-performance natural language understanding (NLU) models, it is necessary to have a benchmark to evaluate and analyze NLU ability from various perspectives. While the English NLU benchmark, GLUE, has been the forerunner, benchmarks are now being released for languages other than English, such as CLUE for Chinese and FLUE for French; but there is no such benchmark for Japanese. We build a Japanese NLU benchmark, JGLUE, from scratch without translation to measure the general NLU ability in Japanese. We hope that JGLUE will facilitate NLU research in Japanese.",
}
@InProceedings{Kurihara_nlp2022,
author = "栗原健太郎 and 河原大輔 and 柴田知秀",
title = "JGLUE: 日本語言語理解ベンチマーク",
booktitle = "言語処理学会第28回年次大会",
year = "2022",
url = "https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E8-4.pdf"
note= "in Japanese"
}
License
This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />
Contributor License Agreement
This project requires contributors to accept the terms in the Contributor License Agreement (CLA).
Please note that contributors to the JGLUE repository on GitHub (https://github.com/yahoojapan/JGLUE) shall be deemed to have accepted the CLA without individual written agreements.