Home

Awesome

C<sup>3</sup>

Overview

This repository maintains C<sup>3</sup>, the first free-form multiple-Choice Chinese machine reading Comprehension dataset.

@article{sun2019investigating,
  title={Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension},
  author={Sun, Kai and Yu, Dian and Yu, Dong and Cardie, Claire},
  journal={Transactions of the Association for Computational Linguistics},
  year={2020},
  url={https://arxiv.org/abs/1904.09679v3}
}

Files in this repository:

[
  [
    [
      document 1
    ],
    [
      {
        "question": document 1 / question 1,
        "choice": [
          document 1 / question 1 / answer option 1,
          document 1 / question 1 / answer option 2,
          ...
        ],
        "answer": document 1 / question 1 / correct answer option
      },
      {
        "question": document 1 / question 2,
        "choice": [
          document 1 / question 2 / answer option 1,
          document 1 / question 2 / answer option 2,
          ...
        ],
        "answer": document 1 / question 2 / correct answer option
      },
      ...
    ],
    document 1 / id
  ],
  [
    [
      document 2
    ],
    [
      {
        "question": document 2 / question 1,
        "choice": [
          document 2 / question 1 / answer option 1,
          document 2 / question 1 / answer option 2,
          ...
        ],
        "answer": document 2 / question 1 / correct answer option
      },
      {
        "question": document 2 / question 2,
        "choice": [
          document 2 / question 2 / answer option 1,
          document 2 / question 2 / answer option 2,
          ...
        ],
        "answer": document 2 / question 2 / correct answer option
      },
      ...
    ],
    document 2 / id
  ],
  ...
]
<table> <tr> <th></th> <th>Abbreviation</th> <th>Question Type</th> </tr> <tr> <td rowspan="1">Matching</td> <td>m</td> <td>Matching</td> </tr> <tr> <td rowspan="10">Prior knowledge</td> <td>l</td> <td>Linguistic</td> </tr> <tr> <td>s</td> <td>Domain-specific</td> </tr> <tr> <td>c-a</td> <td>Arithmetic</td> </tr> <tr> <td>c-o</td> <td>Connotation</td> </tr> <tr> <td>c-e</td> <td>Cause-effect</td> </tr> <tr> <td>c-i</td> <td>Implication</td> </tr> <tr> <td>c-p</td> <td>Part-whole</td> </tr> <tr> <td>c-d</td> <td>Precondition</td> </tr> <tr> <td>c-h</td> <td>Scenario</td> </tr> <tr> <td>c-n</td> <td>Other</td> </tr> <tr> <td rowspan="3">Supporting Sentences</td> <td>0</td> <td>Single Sentence</td> </tr> <tr> <td>1</td> <td>Multiple sentences</td> </tr> <tr> <td>2</td> <td>Independent</td> </tr> </table>

Note:

  1. Fine-tuning Chinese BERT-wwm or BERT-wwm-ext follows the same steps except for downloading their pre-trained language models.
  2. There is randomness in model training, so you may want to run multiple times to choose the best model based on development set performance. You may also want to set different seeds (specify --seed when executing run_classifier.py).
  3. Depending on your hardware, you may need to change gradient_accumulation_steps.
  4. The code has been tested with Python 3.6 and PyTorch 1.0.