Awesome

MSMedCap

This repo contains source code for our MSMedCap paper.

1.Introduction

Current generic text and image pretrained models do not yield satisfactory results when it comes to describing intricate details within medical images. So, Based on classical multimodality model BLIP 2, we present a novel medical image captioning method guided by the segment anything model (SAM) to enable enhanced encoding with both general and detailed feature extraction. The architecture of our model is shown below.

Model architecture

Compared to the classical model BLIP 2, the MSMedCap shows significant improvement on medical datasets.

performance

2.Setup Instructions

Please create a virtual environment with Python 3.8 and activate it.

conda create -n msmedcap python=3.8
conda activate msmedcap

and type the following instruction.

git clone https://github.com/AHandsomePython/MSMedCap.git 
cd MSMedCap
pip install -r requirements.txt

3.Run the code

Our code is revised based on lavis. We use the opt model with 2.7 billion parameters in lavis. So, only files about opt model and Q-Former are revised and other files stay the same as lavis.

3.1 Training

Training Stage 1

The config file can be edited at

MSMedCap/lavis/projects/blip2/train/pretrain_stage1.yaml

If you want to load pretrain pth file, go to:

MSMedCap/lavis/configs/models/blip2/blip2_pretrain.yaml

The input data should be some images and some json annotations, replace them in:

MSMedCap/lavis/configs/datasets/coco/defaults_cap.yaml

To run the training stage 1, type the instruction:

python -m torch.distributed.run --nproc_per_node=2 train.py --cfg-path lavis/projects/blip2/train/pretrain_stage1.yaml

Please note that you should train Q-Former of SAM and BLIP 2 individually and run the jupyter file transform.ipynb to merge the weights.

Training Stage 2

The config file can be edited at

MSMedCap/lavis/projects/blip2/train/pretrain_stage2.yaml

To run the training stage 2, type the instruction:

python -m torch.distributed.run --nproc_per_node=2 train.py --cfg-path lavis/projects/blip2/train/pretrain_stage2.yaml

4.Available checkpoints

Link to pre-trained weights can be downloaded from Google Drive

5.Test

Run the jupyter file generate.ipynb to test your output. The config file can be edited at

MSMedCap/lavis/configs/models/blip2/blip2_caption_opt2.7b.yaml

You can also use instruction to evaluate the output.

python -m torch.distributed.run --nproc_per_node=2 evaluate.py --cfg-path lavis/projects/blip2/eval/caption_coco_opt2.7b_eval.yaml

6.Ackonwledgements

We are grateful to LAVIS and SAM, on which our codes are developed.

7.Citation

If you find our paper and/or code helpful, please consider citing:

@inproceedings{zhang2024sam,
  title={Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning},
  author={Zhang, Zhenyu and Wang, Benlu and Liang, Weijie and Li, Yizhi and Guo, Xuechen and Wang, Guanhong and Li, Shiyan and Wang, Gaoang},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1731--1735},
  year={2024},
  organization={IEEE}
}