Home

Awesome

Med-Eval

English | 日本語 | 中文

Contributors

Updates

Installation

Supported environment

Clone this repository

git clone https://github.com/nii-nlp/med-eval.git
cd med-eval

Preliminary

PyTorch

We recommend the following installation command for PyTorch since we only verify our codes with PyTorch 1.13.1 + CUDA 11.7. You can find more information on the official website.

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

Others

pip install -r requirements.txt

Downloading datasets

# git lfs install # Make sure you have git-lfs installed (https://git-lfs.com)
git clone https://huggingface.co/datasets/Coldog2333/JMedBench

Introduction

This is a submodule in the JMed-LLM repository, with a similar but more flexible framework as lm-evaluation-harness.

lm-evaluation-harness is a widely used library for evaluating language models, especially on Multi-Choice Question Answering (MCQA) tasks, by computing conditional log-likelihoods for each option. However, it is not flexible enough to support the evaluation of models in many cases:

  1. When we want to evaluate one task with different templates (prompts), we need to modify the source codes for each task.
  2. When we want to evaluate on a local dataset, it is hard to define.
  3. The version we were using didn't support evaluation with multiple GPUs.

Considering these issues, we developed this submodule to support the evaluation of models in a more flexible way.

How to do evaluation on defined tasks?

Here is an example of how to evaluate a model on the MedMCQA task with the mcqa_with_options template.

#!/bin/bash

BASE_PATH="/home/jiang/mainland/med-eval"   # Change this to your own path
export PYTHONPATH=$BASE_PATH
export TOKENIZERS_PARALLELISM=false

N_NODE=${1:-1}                              # Number of GPUs for evaluation
MASTER_PORT=${10:-2333}

model_name_or_path=${2:-"gpt2"}             # HF model name or checkpoint dir

task=${3:-"medmcqa"}                        # example: medmcqa / medmcqa,pubmedqa (evaluate multiple tasks at the same time)
template=${4:-"mcqa_with_options"}          # example: mcqa / mcqa_with_options,context_based_mcqa
batch_size=${5:-32}
num_fewshot=${6:-0}
seed=${7:-42}
model_max_length=${8:--1}

torchrun --nproc_per_node=${N_GPU} \
         --master_port $MASTER_PORT \
          "${BASE_PATH}/evaluate_mcqa.py" \
            --model_name_or_path ${model_name_or_path} \
            --task ${task} \
            --template_name ${template_name} \
            --batch_size ${batch_size} \
            --num_fewshot ${num_fewshot} \
            --seed ${seed} \
            --model_max_length ${model_max_length} \
            --truncate False

Quick evaluation on JMedBench

We also implemented several scripts to evaluate models various supported tasks. You can find them in the scripts/evaluation directory.

If you want to do evaluation on JMedBench, you can use the following one-line command:

bash scripts/evaluation/evaluate_jmedbench.sh ${model_name_or_path}

For example, if we want to evaluate Llama2-7B, we can use the following command:

bash scripts/evaluation/evaluate_jmedbench.sh "meta-llama/Llama-2-7b-hf"

After the evaluation, you could collect the results from the standard output.

Pipeline

EvaluationPipeline is the core class in this submodule, which is used to evaluate models on different tasks. The pipeline consists of the following steps:

  1. Setup the environment, using single or multiple GPUs with PyTorch DDP.
  2. Load the model and tokenizer.
  3. Load the dataset in a specific format (list of dataclass: MCQASample).
  4. Based on the given template, prepare all the requests and compute losses.
  5. Collect the losses from all GPUs and compute the final metrics.
    • Since we use DDP, some requests will be computed for multiple times and the losses may not be the same due to the precision. Therefore, we average them as the final loss.

Prerequisites

Tasks and prompt templates

Supported tasks

  1. MCQA tasks
    • medmcqa: MedMCQA
    • medmcqa_jp: MedMCQA-JP
    • usmleqa: USMLE-QA (4 Options)
    • usmleqa_jp: USMLE-QA-JP (4 Options)
    • medqa: Med-QA (5 Options)
    • medqa_jp: Med-QA-JP (5 Options)
    • pubmedqa: PubMedQA
    • pubmedqa_jp: PubMedQA-JP
    • igakuqa: IgakuQA (5-6 options)
    • igakuqa_en: IgakuQA-EN (5-6 options)
    • mmlu_medical: MMLU-Medical
      • Some medical-related subsets from MMLU.
    • mmlu_medical_jp: MMLU-Medical-JP
    • jmmlu: JMMLU
    • jmmlu_medical: JMMLU-Medical
  2. MT tasks
    • ejmmt: EJMMT (en->ja, ja->en)
  3. NER tasks
    • mrner_disease: MRNER-Disease from JMED-LLM
    • mrner_medicine: MRNER-Medicine from JMED-LLM
    • nrner: NRNER from JMED-LLM
    • bc2gm_jp: BC2GM from BLURB
    • bc5chem_jp: BC5Chem from BLURB
    • bc5disease_jp: BC5Disease from BLURB
    • jnlpba_jp: JNLPBA from BLURB
    • ncbi_disease_jp: NCBI-Disease from BLURB
  4. Document Classification
    • crade
    • rrtnm
    • smdis
  5. Semantic Text Similarity
    • jcsts: Japanese Clinical Semantic Text Similarity

Supported prompt templates

For each task, there are four prompt templates:

TaskMinimalStandardEnglish CentricInstrcuted
MCQA (except pubmedqa*)mcqa_minimalmcqa_with_options_jpmcqa_with_options4o_mcqa_instructed_jp
MCQA (pubmedqa*)context_based_mcqa_minimalcontext_based_mcqa_jpcontext_based_mcqacontext_based_mcqa_instructed_jp
MT (en-ja)mt_minimalenglish_japanesemt_english_centric_e2jmt_instructed_e2j
MT (ja-en)mt_minimaljapanese_englishmt_english_centric_j2emt_instructed_j2e
NERminimalstandardenglish-centricinstructed
DCcontext_based_mcqa_minimaldc_with_options_jpdc_with_optionsdc_instructed_jp
STSsts_minimalsts_as_nli_jpsts_as_nlists_instructed_jp

See template directory for details.<br> Other templates can be found in the templates module.

How to define a new task?

  1. Go to the tasks/base.py module.
  2. Define a function to load the dataset in a specific format.
    • output: Dict[str, List[MCQASample]]
    • MUST include "test" key. Optionally, you can include "train" keys for few-shot evaluation, or you could turn on use_fake_demo when running.

Appendix

Statistics

MCQA#Train#Test
MedMCQA (jp)182,8224,183
USMLE-QA (jp)10,1781,273
MedQA (jp)10,1781,273
MMLU-medical (jp)451,871
JMMLU-medical (jp)45*1,271
IgakuQA10,178*989
PubMedQA (jp)1,0001,000
MT#Train#Test
EJMMT802,400
NER#Train#Test
BC2GM (jp)12,5725,037
BC5Chem (jp)4,5624,801
BC5Disease (jp)4,5604,797
JNLPBA (jp)18,6074,260
NCBI-Disease (jp)5,424940
DC#Train#Test
CRaDE892
RRTNM1189
SMDIS1684
STS#Train#Test
JCSTS1703,500

TODO

Citation

If you find this code helpful for your research, please cite the following paper:

@misc{jiang2024jmedbenchbenchmarkevaluatingjapanese,
      title={JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models}, 
      author={Junfeng Jiang and Jiahao Huang and Akiko Aizawa},
      year={2024},
      eprint={2409.13317},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.13317}, 
}