Home

Awesome

[EMNLP24] Self-Training Large Language and Vision Assistant for Medical

<em> The advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce <strong>S</strong>elf-<strong>T</strong>raining <strong>L</strong>arge <strong>L</strong>anguage <strong>a</strong>nd <strong>V</strong>ision <strong>A</strong>ssistant for <strong>Med</strong>icine (STLLaVA-Med).</em>

<strong> Self-Training Large Language and Vision Assistant for Medical Question-Answering </strong> [paper][HF Model]

Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, Zhiqiang Tao

<p align="center"> <img src="./images/cover.jpg" width="500px"> <br> Medical data usage and performance comparision between LLaVA-Med and our method. </p> <p align="center"> <img src="./images/pipeline.jpg" width="500px"> <br> Self-training pipeline for transforming a general Vision-Language assistant to medical expert. </p>

🔥 News

Contents

Install

  1. Install Package
conda create -n stllava python=3.10 -y
conda activate stllava
pip install --upgrade pip  # enable PEP 660 support
cd STLLaVA-Med
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Data

<strong>Visual instructional data</strong>

This project utilizes vision instructional data provided by LLaVA-Med 60k_inline_mention. However, due to disabled image URL, we fillterd out the origional data to ours own version in this project.

<strong>DPO data</strong>

<p align="center"> <img src="./images/preference_data.jpg" width="500px"> <br> DPO data example. </p>

This project auto-generate the preference dataset using the model itself and guided by GPT-4o. We sample 10k medical images from PMC-15M. You may download the dataset via STLLaVA-Med-DPO.

Traininig

Training consists of two stages: (1) visual self-questioning instruction tuning stage, teaching the model to ask questions and follow multimodal instructions; (2) preference optimization.

Instruction tuning:

Training script with DeepSpeed ZeRO-3 and lora: sqllava_med.sh.

Preference optimization:

Training script with DeepSpeed ZeRO-3 and lora: dpo_finetune.sh.

Evaluation

Please download raw images of datasets (VQA-RAD, SLAKE, PVQA) for medical VQA tasks.

Evaluate models on a diverse set of 3 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{Sun2024STLLaVAMedSL,
  title={STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical},
  author={Guohao Sun and Can Qin and Huazhu Fu and Linwei Wang and Zhiqiang Tao},
  booktitle = {EMNLP},
  year={2024},
}

Acknowledgement