Awesome
M2I2
Self-supervised vision-language pretraining for Medical visual question answering
This is the official implementation of M2I2
for the visual question answering task in medical domain at ISBI-2023.
Our proposal achieves superior accuracy in comparison with other state-of-the-art (sota) methods on three public medical VQA datasets: VQA-RAD dataset, PathVQA dataset and Slake dataset. Paper link here.
This repository is based on and inspired by @Junnan Li's work. We sincerely thank for their sharing of the codes.
<div align=center> <img src="fig/model.png" style="zoom:75%;"> </div> <center>Figure 1: Overview of the proposed medical VQA model. </center>Requirements
Run the following command to install the required packages:
pip install -r requirements.txt
Training and Testing
1. Dataset Preparation
Please organize the datasets as the following structure:
+--clef2022
| +--train
| | +--ImageCLEFmedCaption_2022_train_000001.jpg
| | +--ImageCLEFmedCaption_2022_train_000002.jpg
| | +--...
| +--valid
| | +--ImageCLEFmedCaption_2022_valid_084258.jpg
| | +--ImageCLEFmedCaption_2022_valid_084259.jpg
| | +--...
| +--clef22022_train.json
| +--clef22022_valid.json
+--data_RAD
| +--images
| | +--synpic100132.jpg
| | +--synpic100176.jpg
| | +--...
| +--trainset.json
| +--testset.json
| +--answer_list.json
+--data_PathVQA
| +--images
| | +--train
| | | +--train_0000.jpg
| | | +--train_0001.jpg
| | | +--...
| | +--val
| | | +--val_0000.jpg
| | | +--val_0001.jpg
| | | +--...
| | +--test
| | | +--test_0000.jpg
| | | +--test_0001.jpg
| | | +--...
| +--pathvqa_test.json
| +--pathvqa_train.json
| +--pathvqa_val.json
| +--answer_trainval_list.json
+--data_Slake
| +--imgs
| | +--xmlab0
| | | +--source.jpg.jpg
| | | +--question.json
| | | +--...
| | +--....
| +--slake_test.json
| +--slake_train.json
| +--slake_val.json
| +--answer_list.json
2. Pre-training
python3 pretrain_med.py --output_dir ./pretrain
3. Finetune on Medical VQA tasks
python3 train_rad.py --checkpoint ./pretrain/med_pretrain_29.pth --output_dir ./output/rad
python3 train_pathvqa.py --checkpoint ./pretrain/med_pretrain_29.pth --output_dir ./output/pathvqa
python3 train_slake.py --checkpoint ./pretrain/med_pretrain_29.pth --output_dir ./output/slake
4. Evaluate on Medical VQA tasks
python3 vqaRadEval.py --quesFile ./data_Rad/testset.json --resFile ./output/rad/result/med_pretrain_29_vqa_result_<epoch>.json
python3 vqaPathEval.py --quesFile ./data_PathVQA/pathvqa_test.json --resFile ./output/pathvqa/result/med_pretrain_29_vqa_result_<epoch>.json
python3 vqaSlakeEval.py --quesFile ./data_Slake/slake_test.json --resFile ./output/slake/result/med_pretrain_29_vqa_result_<epoch>.json
Comparison with the sota
VQA-Rad dataset
<img src="fig/table_1.png">PathVQA dataset
<img src="fig/table_2.png">Slake dataset
<img src="fig/table_3.png">Citation:
@article{M2I2,
title = {Self-supervised vision-language pretraining for Medical visual question answering},
author = {Pengfei Li, Gang Liu, Lin Tan, Jinying Liao and Shenjun Zhong},
journal = {arXiv preprint arXiv.2211.13594},
year = {2022}
}
License
MIT License