Home

Awesome

MultiMedEval

MultiMedEval is a library to evaluate the performance of Vision-Language Models (VLM) on medical domain tasks. The goal is to have a set of benchmark with a unified evaluation scheme to facilitate the development and comparison of medical VLM. We include 24 tasks representing 10 different imaging modalities and some text-only tasks.

tests workflow PyPI - Version PyPI - Python Version GitHub License

Tasks

<details> <summary>Question Answering</summary>
TaskDescriptionModalitySize
MedQAMultiple choice questions on general medical knowledgeGeneral medicine1273
PubMedQAYes/no/maybe questions based on PubMed paper abstractsGeneral medicine500
MedMCQAMultiple choice questions on general medical knowledgeGeneral medicine4183
</details> </br> <details> <summary>Visual Question Answering</summary>
TaskDescriptionModalitySize
VQA-RADOpen ended questions on radiology imagesX-ray451
Path-VQAOpen ended questions on pathology imagesPathology6719
SLAKEOpen ended questions on radiology imagesX-ray1061
</details> </br> <details> <summary>Report Comparison</summary>
TaskDescriptionModalitySize
MIMIC-CXR-ReportGenerationGeneration of finding sections of radiology reports based on the radiology imagesChest X-ray2347
MIMIC-IIISummarization of radiology reportsText13054
</details> </br> <details> <summary>Natural Language Inference</summary>
TaskDescriptionModalitySize
MedNLINatural Language Inference on medical sentences.General medicine1422
</details> </br> <details> <summary>Image Classification</summary>
TaskDescriptionModalitySize
MIMIC-CXR-ImageClassificationClassification of radiology images into 5 diseasesChest X-ray5159
VinDr-MammoClassification of mammography images into 5 BIRADS levelsMammography429
Pad-UFES-20Classification of skin lesion images into 7 diseasesDermatology2298
CBIS-DDSM-MassClassification of masses in mammography images into "benign", "malignant" or "benign without callback"Mammography378
CBIS-DDSM-CalcificationClassification of calcification in mammography images into "benign", "malignant" or "benign without callback"Mammography326
MNIST-OctImage classification of Optical coherence tomography of the retineOCT1000
MNIST-PathImage classification of pathology imagePathology7180
MNIST-BloodImage classification of blood cell seen through a microscopeMicroscopy3421
MNIST-BreastImage classification of mammographyMammography156
MNIST-DermaImage classification of skin deffect imagesDermatology2005
MNIST-OrganCImage classification of abdominal CT scanCT8216
MNIST-OrganSImage classification of abdominal CT scanCT8827
MNIST-PneumoniaImage classification of chest X-RaysX-Ray624
MNIST-RetinaImage classification of the retina taken with a fondus cameraFondus Camera400
MNIST-TissueImage classification of kidney cortex seen through a microscopeMicroscopy12820
</details> </br> <p align="center"> <img src="figures/sankey.png" alt="Sankey graph"> <br> <em>Representation of the modalities, tasks and datasets in MultiMedEval</em> </p>

Setup

To install the library, you can use pip

pip install multimedeval

To run the benchmark on your model, you first need to create an instance of the MultiMedEval class.

from multimedeval import MultiMedEval, SetupParams, EvalParams
engine = MultiMedEval()

You then need to call the setup function of the engine. This will download the datasets if needed and prepare them for evaluation. You can specify where to store the data and which datasets you want to download.

setupParams = SetupParams(medqa_dir="data/")
tasksReady = engine.setup(setup_params=setupParams)

Here we initialize the SetupParams dataclass with only the path for the MedQA dataset. If you omit to pass a directory for some of the datasets, they will be skipped during the evaluation. During the setup process, the script will need a Physionet username and password to download "VinDr-Mammo", "MIMIC-CXR" and "MIMIC-III". You also need to setup Kaggle on your machine before running the setup as the "CBIS-DDSM" is hosted on Kaggle. At the end of the setup process, you will see a summary of which tasks are ready and which didn't run properly and the function will return a summary in the form of a dictionary.

Usage

Implement the Batcher

The user must implement one Callable: batcher. It takes a batch of input and must return the answer. The batch is a list of inputs. Each input is an instance of @dataclass BatcherInput, containing the following fields:

[
    BatcherInput(
        conversation = 
          [
              {"role": "user", "content": "This is a question with an image <img>."},
              {"role": "assistant", "content": "This is the answer."},
              {"role": "user", "content": "This is a question with an image <img>."},
          ],
        images = [PIL.Image(), PIL.Image()],
        segmentation_masks = [PIL.Image(), PIL.Image()]
    ),
    BatcherInput(
        conversation =
          [
              {"role": "user", "content": "This is a question without images."},
              {"role": "assistant", "content": "This is the answer."},
              {"role": "user", "content": "This is a question without images."},
          ],
        images = [],
        segmentation_masks = []
    ),

]

Here is an example of a batcher without any logic:

def batcher(prompts) -> list[str]:
    return ["Answer" for _ in prompts]

A function is the simplest example of a Callable but the batcher can also be implemented as a Callable class (i.e. a class implementing the __call__ method). Doing it this way allows to initialize the model in the __init__ function of the class. We give an example for the Mistral model (a language-only model).

class batcherMistral:
    def __init__(self) -> None:
        self.model: MistralModel = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
        self.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def __call__(self, prompts):
        model_inputs = [self.tokenizer.apply_chat_template(messages.conversation, return_tensors="pt", tokenize=False) for messages in prompts]
        model_inputs = self.tokenizer(model_inputs, padding="max_length", truncation=True, max_length=1024, return_tensors="pt")

        generated_ids = self.model.generate(**model_inputs, max_new_tokens=200, do_sample=True, pad_token_id=self.tokenizer.pad_token_id)

        # Remove the first 1024 tokens (prompt)
        generated_ids = generated_ids[:, model_inputs["input_ids"].shape[1] :]

        answers = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
        return answers

Run the benchmark

To run the benchmark, call the eval method of the MultiMedEval class with the list of tasks to benchmark on, the batcher to ealuate and the evaluation parameters. If the list is empty, all the tasks will be benchmarked.

evalParams = EvalParams(batch_size=128)
results = engine.eval(["MedQA", "VQA-RAD"], batcher, eval_params=evalParams)

MultiMedEval parameters

The SetupParams class takes a path for each dataset:

The EvalParams class takes the following arguments:

Additional tasks

To add a new task to the list of already implemented ones, create a folder named MultiMedEvalAdditionalDatasets and a subfolder with the name of your dataset.

Inside your dataset folder, create a json file that follows the following template for a VQA dataset:

{
  "taskType": "VQA",
  "modality": "Radiology",
  "samples": [
    {
      "question": "Question 1",
      "answer": "Answer 1",
      "images": ["image1.png", "image2.png"]
    },
    { "question": "Question 2", "answer": "Answer 2", "images": ["image1.png"] }
  ]
}

And for a QA dataset:

{
  "taskType": "QA",
  "modality": "Pathology",
  "samples": [
    {
      "question": "Question 1",
      "answer": "Answer 1",
      "options": ["Option 1", "Option 2"],
      "images": ["image1.png", "image2.png"]
    },
    {
      "question": "Question 2",
      "answer": "Answer 2",
      "options": ["Option 1", "Option 2"],
      "images": ["image1.png"]
    }
  ]
}

Note that in both cases the images key is optional. If the taskType is VQA, the metrics computed will be BLEU-1, accuracy for closed and open questions, recall and recall for open questions as well as F1. For the QA taskType, the tool will report the accuracy (by comparing the answer to every option using BLEU).

Reference

@misc{royer2024multimedeval,
      title={MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models},
      author={Corentin Royer and Bjoern Menze and Anjany Sekuboyina},
      year={2024},
      eprint={2402.09262},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}