Home

Awesome

Note: If you are looking for a multimodal dataset, check out our new dataset, ChiMed-VL-Instruction, with 469,441 vision-language QA pairs: https://paperswithcode.com/dataset/qilin-med-vl)

This paper was presented at NeurIPS 2023, New Orleans, Louisana. See here for the poster and slides.

Benchmarking Large Language Models on CMExam - A Comprehensive Chinese Medical Exam Dataset

Introduction

CMExam is a dataset sourced from the Chinese National Medical Licensing Examination. It consists of 60K+ multiple-choice questions and five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, comprehensive benchmarks were conducted on representative LLMs on CMExam.

<img src="https://github.com/williamliujl/CMExam/blob/main/docs/example.png" width="860" />

Dataset Statistics

TrainValTestTotal
Question54,4976,8116,81168,119
Vocab4,5453,6203,5994,629
Max Q tokens676500585676
Max A tokens5555
Max E tokens2,9992,6782,6802,999
Avg Q tokens29.7830.0732.6330.83
Avg A tokens1.081.071.071.07
Avg E tokens186.24188.95201.44192.21
Median (Q1, Q3) Q tokens17 (12, 32)18 (12, 32)18 (12, 37)18 (12, 32)
Median (Q1, Q3) A tokens1 (1, 1)1 (1, 1)1 (1, 1)1 (1, 1)
Median (Q1, Q3) E tokens146 (69, 246)143 (65, 247)158 (80, 263)146 (69, 247)

*Q: Question; A: Answer; E: Explanation

Annotation Characteristics

Annotation ContentReferencesUnique values
Disease GroupsThe 11th revision of ICD-1127
Clinical DepartmentsThe Directory of Medical Institution Diagnostic and Therapeutic Categories (DMIDTC)36
Medical DisciplinesList of Graduate Education Disciplinary Majors (2022)7
Medical CompetenciesMedical Professionals4
Difficulty LevelHuman Performance5

Benchmarks

Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam.

<img src="https://github.com/williamliujl/CMExam/blob/main/docs/overall_comparison.jpg" width="860" />

Deployment

To deploy this project run

Environment Setup

  cd src
  pip install -r requirements.txt

Data Preprocess

  cd preprocess
  python generate_prompt.py

Ptuning

  cd ../ptuning
  bash train.sh
  bash prediction.sh

LoRA

  cd ../LoRA
  bash ./scripts/finetune.sh
  bash ./scripts/infer_ori.sh
  bash ./scripts/infer_sft.sh

Evaluation

  cd ../evaluation
  python evaluate_lora_results.py --csv_file_path path/to/csv/file

Side notes

Limitations:

Ethics in Data Collection:

Future directions:

Citation

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset https://arxiv.org/abs/2306.03030

@article{liu2023benchmarking,
  title={Benchmarking Large Language Models on CMExam--A Comprehensive Chinese Medical Exam Dataset},
  author={Liu, Junling and Zhou, Peilin and Hua, Yining and Chong, Dading and Tian, Zhongyu and Liu, Andrew and Wang, Helin and You, Chenyu and Guo, Zhenhua and Zhu, Lei and others},
  journal={arXiv preprint arXiv:2306.03030},
  year={2023}
}