Home

Awesome

MIMIC-CXR-VQA

License:Physionet GitHub release GitHub last commit Code style: black

A new collection of medical visual question answering dataset on MIMIC-CXR database

Overview

The MIMIC-CXR-VQA dataset is a complex (involving set and logical operations), diverse (with 48 templates), and large-scale (approximately 377K) resource, designed specifically for Visual Question Answering (VQA) tasks in the medical domain. Primarily focusing on chest radiographs, this dataset was mainly derived from the MIMIC-CXR-JPG and Chest ImaGenome datasets, both of which were sourced from Physionet.

The goal of the MIMIC-CXR-VQA dataset is to serve as a benchmark for evaluating the effectiveness of current medical VQA approaches. It not only functions as a tool for traditional medical VQA tasks but also has the unique quality of being an image-based Electronic Health Records (EHRs) Question Answering dataset resource. Therefore, we utilize question templates from the MIMIC-CXR-VQA dataset as seed question templates for image modality, to construct a multi-modal EHR QA dataset, EHRXQA.

Updates

Table of Contents

Features

Installation

For Linux:

Ensure that you have Python 3.8.5 or higher installed on your machine. Set up the environment and install the required packages using the commands below:

# Set up the environment
conda create --name mimiccxrvqa python=3.8.5

# Activate the environment
conda activate mimiccxrvqa

# Install required packages
pip install pandas==1.1.3 tqdm==4.65.0 scikit-learn==0.23.2

Setup

Clone this repository and navigate into it:

git clone https://github.com/baeseongsu/mimic-cxr-vqa.git
cd mimic-cxr-vqa

Usage

Privacy

We take data privacy very seriously. All of the data you access through this repository has been carefully prepared to prevent any privacy breaches or data leakage. You can use this data with confidence, knowing that all necessary precautions have been taken.

Access Requirements

The MIMIC-CXR-VQA dataset is constructed from the MIMIC-CXR-JPG (v2.0.0), Chest ImaGenome (v1.0.0), and MIMIC-IV (v2.2). All these source datasets require a credentialed Physionet license. Due to these requirements and in adherence to the Data Use Agreement (DUA), only credentialed users can access the MIMIC-CXR-VQA dataset files (see Access Policy). To access the source datasets, you must fulfill all of the following requirements:

  1. Be a credentialed user
    • If you do not have a PhysioNet account, register for one here.
    • Follow these instructions for credentialing on PhysioNet.
    • Complete the "CITI Data or Specimens Only Research" training course.
  2. Sign the data use agreement (DUA) for each project

Accessing the MIMIC-CXR-VQA Dataset

<!-- While the complete MIMIC-CXR-VQA dataset is being prepared for publication on the Physionet platform, we provide partial access to the dataset via this repository for credentialed users. The MIMIC-CXR-VQA dataset mainly comprises three components: an image (I), a question (Q), and an answer (A). In this partial release, we omit the answer (A) and certain metadata, thereby maintaining privacy by preventing any instance-level information leakage. Moreover, during the creation of the dataset, we carefully implemented an unbiased sampling strategy for images, questions, and answers. This ensures no distribution-level leakage, such as the image-question distribution. -->

To facilitate easy access to the MIMIC-CXR-VQA dataset for users who have pre-downloaded the MIMIC-CXR, MIMIC-IV, and Chest ImaGenome datasets, please ensure the predefined directory global variables (MIMIC_IV_BASE_DIR, MIMIC_CXR_BASE_DIR, CHEST_IMAGENOME_BASE_DIR) in the script align with your local dataset paths.

To generate the MIMIC-CXR-VQA dataset from your pre-downloaded datasets, run the main script as follows:

bash build_dataset.sh

Alternatively, if you prefer to download the source datasets directly from Physionet and then generate the MIMIC-CXR-VQA dataset, use the script below, which requires your Physionet credentials:

bash download_and_build_dataset.sh

When running the script, you'll be prompted to enter your PhysioNet credentials:

The script undertakes several actions: (1) downloading the source datasets from Physionet, (2) preprocessing these datasets, and (3) generating the complete MIMIC-CXR-VQA dataset by creating ground-truth answer information.

<!-- Ensure you keep your credentials secure. If you encounter any issues, please ensure that you have the necessary permissions, a stable internet connection, and all prerequisite tools installed. -->

Downloading MIMIC-CXR-JPG Images

To enhance user convenience, we will provide a script that allows you to download only the CXR images relevant to the MIMIC-CXR-VQA dataset, rather than downloading all the MIMIC-CXR-JPG images.

bash download_images.sh

During script execution, enter your PhysioNet credentials when prompted:

This script performs several actions: 1) it reads the image paths from the JSON files of the MIMIC-CXR-VQA dataset; 2) uses these paths to download the corresponding images from the MIMIC-CXR-JPG dataset hosted on Physionet; and 3) saves these images locally in the corresponding directories as per their paths.

Dataset Structure

The dataset is structured as follows:

mimiccxrvqa
└── dataset
    ├── ans2idx.json
    ├── _train_part1.json
    ├── _train_part2.json
    ├── _valid.json
    ├── _test.json
    ├── train.json (available post-script execution)
    ├── valid.json (available post-script execution)
    └── test.json  (available post-script execution)

Dataset Description

The QA samples in the MIMIC-CXR-VQA dataset are stored in individual .json files. Each file contains a list of Python dictionaries with keys that indicate:

Note that these details can be open-sourced without safety concerns and without revealing the dataset's distribution information (including image, question, and answer distributions), thanks to our uniform sampling strategy.

After validating the PhysioNet credentials, the create_answer.py script generates the following items:

To be specific, here is the example instance:

{
    "split": "train",
    "idx": 13280,
    "image_id": "34c81443-5a19ccad-7b5e431c-4e1dbb28-42a325c0",
    "question": "Are there signs of both pleural effusion and lung cancer in the left lower lung zone?",
    "content_type": "attribute",
    "semantic_type": "verify",
    "template": "Are there signs of both ${attribute_1} and ${attribute_2} in the ${object}?",
    "template_program": "program_5",
    "template_arguments": {
      "object": {
        "0": "left lower lung zone"
      },
      "attribute": {
        "0": "pleural effusion",
        "1": "lung cancer"
      },
      "category": {},
      "viewpos": {},
      "gender": {}
    },
	"answer": "Will be generated by dataset_builder/generate_answer.py"
	"subject_id": "Will be generated by dataset_builder/generate_answer.py"
	"study_id": "Will be generated by dataset_builder/generate_answer.py"
	"image_path": "Will be generated by dataset_builder/generate_answer.py"
}

Versioning

We employ semantic versioning for our dataset, with the current version being v1.0.0. Generally, we will maintain and provide updates only for the latest version of the dataset. However, in cases where significant updates occur or when older versions are required for validating previous research, we may exceptionally retain previous dataset versions for a period of up to one year. For a detailed list of changes made in each version, check out our CHANGELOG.

Contributing

Contributions to enhance the usability and functionality of this dataset are always welcomed. If you're interested in contributing, feel free to fork this repository, make your changes, and then submit a pull request. For significant changes, please first open an issue to discuss the proposed alterations.

Contact

For any questions or concerns regarding this dataset, please feel free to reach out to us (seongsu@kaist.ac.kr or kyungdaeun@kaist.ac.kr). We appreciate your interest and are eager to assist.

Acknowledgements

More details will be provided soon.

Citation

When you use the MIMIC-CXR-VQA dataset, we would appreciate it if you cite the following:

@article{bae2024ehrxqa,
  title={EHRXQA: A multi-modal question answering dataset for electronic health records with chest x-ray images},
  author={Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric and Kim, Tackeun and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

License

The code in this repository is provided under the terms of the MIT License. The final output of the dataset created using this code, the MIMIC-CXR-VQA, is subject to the terms and conditions of the original datasets from Physionet: MIMIC-CXR-JPG License, Chest ImaGenome License, and MIMIC-IV License.