Home

Awesome

MLEC-QA

This repository contains the data and baseline code of The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021) paper "MLEC-QA: A Chinese Multi-Choice Biomedical Question Answering Dataset".

If you would like to use the data or code, please cite:

@inproceedings{li-etal-2021-mlec,
    title = "{MLEC-QA}: {A} {C}hinese {M}ulti-{C}hoice {B}iomedical {Q}uestion {A}nswering {D}ataset",
    author = "Li, Jing  and
      Zhong, Shangping  and
      Chen, Kaizhi",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.698",
    pages = "8862--8874",
}

MLEC-QA is a Chinese multi-choice Biomedical Question Answering Dataset. Questions in MLEC-QA are collected from the National Medical Licensing Examination in China (NMLEC), which are carefully designed by human experts to evaluate professional knowledge and skills for those who want to be medical practitioners in China.

We hope the release of the MLEC-QA dataset can serve as a valuable resource for research and evaluation in Open-domain QA, and also make advances for Biomedical Question Answering systems.

Dataset

Download MLEC-QA dataset: Google Drive

MLEC-QA is composed of 5 subsets with 136,236 Chinese multi-choice biomedical questions with extra materials (images or tables) annotated by human experts, and covers the following biomedical sub-fields:

The JSON dataset file format is as follows:

{
	"qid":The question ID,
	"qtype":["A1型题", "B1型题", "A2型题", "A3/A4型题"],
	"qtext":Description of the question,
	"qimage":Image or table path (if any),
	"options":{
		"A":Description of the option A,
		"B":Description of the option B,
		"C":Description of the option C,
		"D":Description of the option D,
		"E":Description of the option E
	},
	"answer":["A", "B", "C", "D", "E"]
}   	

Baselines

Install the requirements:

cd code
pip install -r requirements.txt

Control Methods

Open-Domain QA Methods

Open-Domain QA Methods is consist of a two-stage retriever-reader framework:

Document Retriever

  1. Download Elasticsearch 7.10.1, Kibana 7.10.1 and run the servers locally with out-of-the-box defaults.
  2. Create an inverted index of Chinese Wikipedia dumps in Elasticsearch using wiki_zh_json2es.py.
  3. Run retriever.py.

Document Reader

Models

The pre-trained language models used on Open-Domain QA Methods can be downloaded from huggingface, and using scripts from the scripts directory to convert them into the format that reader can load directly.

Usage

run_mlecqa.py [-h] [--pretrained_model_path PRETRAINED_MODEL_PATH]
                 [--output_model_path OUTPUT_MODEL_PATH]
                 [--vocab_path VOCAB_PATH] [--spm_model_path SPM_MODEL_PATH]
                 --train_path TRAIN_PATH --dev_path DEV_PATH
                 [--test_path TEST_PATH] [--config_path CONFIG_PATH]
                 [--embedding {word,word_pos,word_pos_seg,word_sinusoidalpos}]
                 [--max_seq_length MAX_SEQ_LENGTH]
                 [--relative_position_embedding]
                 [--relative_attention_buckets_num RELATIVE_ATTENTION_BUCKETS_NUM]
                 [--remove_embedding_layernorm] [--remove_attention_scale]
                 [--encoder {transformer,rnn,lstm,gru,birnn,bilstm,bigru,gatedcnn}]
                 [--mask {fully_visible,causal,causal_with_prefix}]
                 [--layernorm_positioning {pre,post}]
                 [--feed_forward {dense,gated}] [--remove_transformer_bias]
                 [--layernorm {normal,t5}] [--bidirectional]
                 [--factorized_embedding_parameterization]
                 [--parameter_sharing] [--learning_rate LEARNING_RATE]
                 [--warmup WARMUP] [--fp16] [--fp16_opt_level {O0,O1,O2,O3}]
                 [--optimizer {adamw,adafactor}]
                 [--scheduler {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}]
                 [--batch_size BATCH_SIZE] [--seq_length SEQ_LENGTH]
                 [--dropout DROPOUT] [--epochs_num EPOCHS_NUM]
                 [--report_steps REPORT_STEPS] [--seed SEED]
                 [--max_choices_num MAX_CHOICES_NUM]
                 [--tokenizer {bert,char,space}]

The example of using run_mlecqa.py:

python3 run_mlecqa.py --pretrained_model_path models/bert-base.bin \
--vocab_path models/google_zh_vocab.txt \
--train_path datasets/train.json \
--dev_path datasets/dev.json \
--test_path datasets/test.json \
--epochs_num 12 \
--batch_size 1 \
--seq_length 512 \
--max_choices_num 5 \
--learning_rate 2e-6 \
--report_steps 100 \

The actual batch size is --batch_size times --max_choices_num.