Awesome

MONAI Vision Language Models

The repository provides a collection of vision language models, benchmarks, and related applications, released as part of Project MONAI (Medical Open Network for Artificial Intelligence).

💡 News

[2024/12/04] The arXiv version of VILA-M3 is now available here.
[2024/10/31] We released the VILA-M3-3B, VILA-M3-8B, and VILA-M3-13B checkpoints on HuggingFace.
[2024/10/24] We presented VILA-M3 and the VLM module in MONAI at MONAI Day (slides, recording)
[2024/10/24] Interactive VILA-M3 Demo is available online!

VILA-M3

VILA-M3 is a vision language model designed specifically for medical applications. It focuses on addressing the unique challenges faced by general-purpose vision-language models when applied to the medical domain and integrated with existing expert segmentation and classification models.

For details, see here.

Online Demo

Please visit the VILA-M3 Demo to try out a preview version of the model.

Local Demo

Prerequisites

Recommended: Build Docker Container

To run the demo, we recommend building a Docker container with all the requirements. We use a base image with cuda preinstalled.
```
docker build --network=host --progress=plain -t monai-m3:latest -f m3/demo/Dockerfile .
```
Run the container
```
docker run -it --rm --ipc host --gpus all --net host monai-m3:latest bash
```
Note: If you want to load your own VILA checkpoint in the demo, you need to mount a folder using -v <your_ckpts_dir>:/data/checkpoints in your docker run command.
Next, follow the steps to start the Gradio Demo.

Alternative: Manual installation

Linux Operating System
CUDA Toolkit 12.2 (with nvcc) for VILA.

To verify CUDA installation, run:
```
nvcc --version
```
If CUDA is not installed, use one of the following methods:
- Recommended Use the Docker image: nvidia/cuda:12.2.2-devel-ubuntu22.04
```
docker run -it --rm --ipc host --gpus all --net host nvidia/cuda:12.2.2-devel-ubuntu22.04 bash
```
- Manual Installation (not recommended) Download the appropiate package from NVIDIA offical page
Python 3.10 Git Wget and Unzip:

To install these, run
```
sudo apt-get update
sudo apt-get install -y wget python3.10 python3.10-venv python3.10-dev git unzip
```
NOTE: The commands are tailored for the Docker image nvidia/cuda:12.2.2-devel-ubuntu22.04. If using a different setup, adjust the commands accordingly.
GPU Memory: Ensure that the GPU has sufficient memory to run the models:
- VILA-M3: 8B: ~18GB, 13B: ~30GB
- CXR: This expert dynamically loads various TorchXRayVision models and performs ensemble predictions. The memory requirement is roughly 1.5GB in total.
- VISTA3D: This expert model dynamically loads the VISTA3D model to segment a 3D-CT volume. The memory requirement is roughly 12GB, and peak memory usage can be higher, depending on the input size of the 3D volume.
- BRATS: (TBD)

Setup Environment: Clone the repository, set up the environment, and download the experts' checkpoints:

git clone https://github.com/Project-MONAI/VLM --recursive
cd VLM
python3.10 -m venv .venv
source .venv/bin/activate
make demo_m3

Running the Gradio Demo

Navigate to the demo directory:
```
cd m3/demo
```
Start the Gradio demo:

This will automatically download the default VILA-M3 checkpoint from Hugging Face.
```
python gradio_m3.py
```

Alternative: Start the Gradio demo with a local checkpoint, e.g.:

python gradio_m3.py  \
--source local \
--modelpath /data/checkpoints/<8B-checkpoint-name> \
--convmode llama_3

For details, see the available commmandline arguments.

Adding your own expert model

This is still a work in progress. Please refer to the README for more details.

Contributing

To lint the code, please install these packages:

pip install -r requirements-ci.txt

Then run the following command:

isort --check-only --diff .  # using the configuration in pyproject.toml
black . --check  # using the configuration in pyproject.toml
ruff check .  # using the configuration in ruff.toml

To auto-format the code, run the following command:

isort . && black . && ruff format .

References & Citation

If you find this work useful in your research, please consider citing:

@article{nath2024vila,
  title={VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge},
  author={Nath, Vishwesh and Li, Wenqi and Yang, Dong and Myronenko, Andriy and Zheng, Mingxin and Lu, Yao and Liu, Zhijian and Yin, Hongxu and Law, Yee Man and Tang, Yucheng and others},
  journal={arXiv preprint arXiv:2411.12915},
  year={2024}
}