Awesome
Foundation Model for Endoscopy Video Analysis
<!-- select Model and/or Data and/or Code as needed> ### Welcome to OpenMEDLab! š <!-- **Here are some ideas to get you started:** šāāļø A short introduction - what is your organization all about? š Contribution guidelines - how can the community get involved? š©āš» Useful resources - where can the community find your docs? Is there anything else the community should know? šæ Fun facts - what does your team eat for breakfast? š§ Remember, you can do mighty things with the power of [Markdown](https://docs.github.com/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) --> <!-- Insert the project banner here --> <!-- Select some of the point info, feel free to delete -->This repository provides the official PyTorch implementation of the paper Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train by Zhao Wang*, Chang Liu*, Shaoting Zhangā , and Qi Douā .
<div align="center"> <a href="https://"><img width="800px" height="auto" src="assets/framework.png"></a> </div>Key Features
- First foundation model for endoscopy video analysis.
- A large-scale endoscopic video dataset with over 33K video clips.
- Support 3 types of downstream tasks, including classification, segmentation, and detection.
Links
<!-- [Code] may link to your project at your institute> <!-- give a introduction of your project -->Details
<!-- Insert a pipeline of your algorithm here if got one -->Foundation models have exhibited remarkable success in various applications, such as disease diagnosis and text report generation. To date, a foundation model for endoscopic video analysis is still lacking. In this paper, we propose Endo-FM, a foundation model specifically developed using massive endoscopic video data. First, we build a video transformer, which captures both local and global long-range dependencies across spatial and temporal dimensions. Second, we pre-train our transformer model using global and local views via a self-supervised manner, aiming to make it robust to spatial-temporal variations and discriminative across different scenes. To develop the foundation model, we construct a large-scale endoscopy video dataset by combining 9 publicly available datasets and a privately collected dataset from Baoshan Branch of Renji Hospital in Shanghai, China. Our dataset overall consists of over 33K video clips with up to 5 million frames, encompassing various protocols, target organs, and disease types. Our pre-trained Endo-FM can be easily adopted for a given downtream task via fine-tuning by serving as the backbone. With experiments on 3 different types of downstream tasks, including classification, segmentation, and detection, our Endo-FM surpasses the current state-of-the-art self-supervised pre-training and adapter-based transfer learning methods by a significant margin.
Datasets
<div align="center"> <a href="https://"><img width="800px" height="auto" src="assets/dataset_details.png"></a> </div> <div align="center"> <a href="https://"><img width="800px" height="auto" src="assets/dataset_visualization.png"></a> </div>We utilize 6 public and 1 private datasets for pre-training and 3 datasets as the downstream tasks. Except for SUN & SUN-SEG, we provide our preprocessed data for pre-training and downstream tasks.
Pre-training Data (6 public + 1 private)
- Colonoscopic [original paper] [original dataset] [our preprocessed dataset]
- SUN & SUN-SEG [original paper1] [original paper2] [original dataset1] [original dataset2]
- LPPolypVideo [original paper] [original dataset] [our preprocessed dataset]
- Hyper-Kvasir [original paper] [original dataset] [our preprocessed dataset]
- Kvasir-Capsule [original paper] [original dataset] [our preprocessed dataset]
- CholecTriplet [original paper] [original dataset] [our preprocessed dataset]
- Our Private [our preprocessed dataset]
Downstream Data (3 public)
- PolypDiag [original paper] [original dataset] [our preprocessed dataset]
- CVC-12k [original paper] [original dataset] [our preprocessed dataset]
- KUMC [original paper] [original dataset] [our preprocessed dataset]
For SUN & SUN-SEG, you need first request the original videos following this instruction. Then, you can transfer the data for pre-training videos by the following:
cd Endo-FM/data
python sun.py
python sun_seg.py
python trans_videos_pretrain.py
Finally, generating the video list pretrain/train.csv
for pre-training by the following:
cd Endo-FM/data
python gencsv.py
Get Started
Main Requirements
- torch==1.8.0
- torchvision==0.9.0
- pillow==6.2.2
- timm==0.4.12
Installation
We suggest using Anaconda to setup environment on Linux, if you have installed anaconda, you can skip this step.
wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh && zsh Anaconda3-2020.11-Linux-x86_64.sh
Then, we can install packages using provided environment.yaml
.
cd Endo-FM
conda env create -f environment.yaml
conda activate endofm
Pre-trained Weights
You can directly download our pre-trained Endo-FM via this link and put it under checkpoints/
for downstream fine-tuning.
Downstream Fine-tuned Weights
Also, we provide the pre-trained weights of 3 downstream tasks for direct downstream testing.
Dataset | PolypDiag | CVC-12k | KUMC |
---|---|---|---|
Our Paper | 90.7 | 73.9 | 84.1 |
Released Model | 91.5 | 76.6 | 84.0 |
Weights | link | link | link |
Pre-training
cd Endo-FM
wget -P checkpoints/ https://github.com/kahnchana/svt/releases/download/v1.0/kinetics400_vitb_ssl.pth
bash scripts/train_clips32k.sh
Downstream Fine-tuning
# PolypDiag (Classification)
cd Endo-FM
bash scripts/eval_finetune_polypdiag.sh
# CVC (Segmentation)
cd Endo-FM/TransUNet
python train.py
# KUMC (Detection)
cd Endo-FM/STMT
python setup.py build develop
python -m torch.distributed.launch \
--nproc_per_node=1 \
tools/train_net.py \
--master_port=$((RANDOM + 10000)) \
--config-file configs/STFT/kumc_R_50_STFT.yaml \
OUTPUT_DIR log_dir/kumc_finetune
Direct Downstream Testing
# PolypDiag (Classification)
cd Endo-FM
bash scripts/test_finetune_polypdiag.sh
# CVC (Segmentation)
cd Endo-FM/TransUNet
python train.py --test
# KUMC (Detection)
cd Endo-FM/STMT
python setup.py build develop
python -m torch.distributed.launch \
--nproc_per_node=1 \
tools/test_net.py \
--master_port=$((RANDOM + 10000)) \
--config-file configs/STFT/kumc_R_50_STFT.yaml \
MODEL.WEIGHT kumc.pth \
OUTPUT_DIR log_dir/kumc_finetune
šāāļø Feedback and Contact
For further questions, pls feel free to contact Zhao Wang.
š”ļø License
This project is under the Apache License 2.0 license. See LICENSE for details.
š Acknowledgement
Our code is based on DINO, TimeSformer, SVT, TransUNet, and STFT. Thanks them for releasing their codes.
š Citation
If you find this code useful, please cite in your research papers.
@inproceedings{
wang2023foundation,
title={Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train},
author={Zhao Wang and Chang Liu and Shaoting Zhang and Qi Dou},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
pages={101--111},
year={2023},
organization={Springer}
}