Home

Awesome

VISA: Reasoning Video Object Segmentation via Large Language Model

<font size=7><div align='center' >  GitHub stars arXiv Static Badge

</div></font> <div align=center> <img src="assert/architecture.png" style="width:100%;"> </div>

šŸš€ Performance

<div style="text-align: justify;"> VISA demonstrates remarkable proficiency in handling complex segmentation tasks that require: (a) reasoning based on world knowledge; (b) inference of future events; and (c) a comprehensive understanding of video content. </div> <div align=center> <img src="assert/performance.png" style="width:50%;"> </div>

šŸ› ļø Installation

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

šŸ¦„ Training and Validation

1. Training Data Preparation

Before training, please download the datasets, and then configure the path in dataset_config.py.

<details open> <summary> <strong>LISA's Dataset</strong> </summary>

Follow LISA to prepare LISA's datasets. The dataset folder should be stored in the $LISA_ROOT folder.

LISA_ROOT
ā”œā”€ā”€ ade20k
ā”œā”€ā”€ coco
ā”œā”€ā”€ cocostuff
ā”œā”€ā”€ llava_dataset
ā”œā”€ā”€ mapillary
ā”œā”€ā”€ reason_seg
ā”œā”€ā”€ refer_seg
ā””ā”€ā”€ vlpart
</details> <details open> <summary> <strong>Chat-UniVi's Dataset</strong> </summary>

Follow Chat-UniVi/Chat-UniVi-Instruct to prepare Chat-UniVi-Instruct datasets. The dataset folder should be stored in the $ChatUniVi_ROOT folder.

ChatUniVi_ROOT
ā”œā”€ā”€ Fine-tuning
ā”‚   ā”œā”€ā”€ MIMIC_imageonly
ā”‚   ā””ā”€ā”€ VIDEO
ā””ā”€ā”€ ScienceQA_tuning
</details> <details open> <summary> <strong>RVOS's Dataset</strong> </summary>
  1. Reasoning Video Segmentation Datasets: ReVOS.
  2. Referring Video Segmentation Datasets: Ref-Youtube-VOS, Ref-DAVIS17, MeViS.
  3. Open-Vocabulary Video Instance Segmentation Dataset: LV-VIS. Download mask_dict.json and meta_expressions.json from OneDrive or BaiduPan. Then, put the annotations files in the $RVOS_ROOT/lvvis/train directory as follows.
RVOS_ROOT
ā”œā”€ā”€ ReVOS
ā”‚   ā”œā”€ā”€ JPEGImages 
ā”‚   ā”œā”€ā”€ mask_dict.json             
ā”‚   ā”œā”€ā”€ mask_dict_foreground.json   
ā”‚   ā”œā”€ā”€ meta_expressions_train_.json 
ā”‚   ā””ā”€ā”€ meta_expressions_valid_.json 
ā”œā”€ā”€ lvvis
ā”‚   ā””ā”€ā”€ train
|       ā”œā”€ā”€ JPEGImages
|       ā”œā”€ā”€ mask_dict.json
|       ā””ā”€ā”€ meta_expressions.json
ā”œā”€ā”€ Ref-Youtube-VOS
ā”‚   ā”œā”€ā”€ meta_expressions
|   |   ā”œā”€ā”€ train/meta_expressions.json
|   |   ā””ā”€ā”€ valid/meta_expressions.json
ā”‚   ā”œā”€ā”€ train
|   |   ā”œā”€ā”€ JPEGImages
|   |   ā””ā”€ā”€ mask_dict.pkl
ā”‚   ā””ā”€ā”€ valid
|       ā””ā”€ā”€ JPEGImages
ā”œā”€ā”€ davis17
ā”‚   ā”œā”€ā”€ meta_expressions
|   |   ā”œā”€ā”€ train/meta_expressions.json
|   |   ā””ā”€ā”€ valid/meta_expressions.json
ā”‚   ā”œā”€ā”€ train
|   |   ā”œā”€ā”€ JPEGImages
|   |   ā””ā”€ā”€ mask_dict.pkl
ā”‚   ā””ā”€ā”€ valid
|       ā”œā”€ā”€ JPEGImages
|       ā””ā”€ā”€ mask_dict.pkl
ā””ā”€ā”€ mevis
</details>

2. Pre-trained weights

<details open> <summary> <strong>Chat-UniVi</strong> </summary>

To train VISA-7B or 13B, you need to download Chat-UniVi weights from Chat-UniVi-7B and Chat-UniVi-13B.

</details> <details open> <summary> <strong>SAM</strong> </summary>

Download SAM ViT-H pre-trained weights from the link.

</details>

3. Training VISA

# Training VISA-7B
bash scripts/train_7b.sh 

# Extracting fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints.
cd /PATH/TO/VISA-7B/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin

# Merge LoRA Weight
CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
  --version Chat-UniVi/Chat-UniVi \
  --weight /PATH/TO/VISA-7B/pytorch_model.bin \
  --save_path /PATH/TO/VISA-7B/hf_model

4. Validation

<details open> <summary> <strong>1. Using `VISA` to generate predicted mask of each video <a href="https://github.com/cilinyan/VISA/blob/main/scripts/val_7b_video.sh">[demo]</a></strong> </summary>
deepspeed --master_port=24999 train_ds.py \
  --version="/PATH/TO/VISA-7B/hf_model" \
  --vision_pretrained="/PATH/TO/sam_vit_h_4b8939.pth" \
  --log_base_dir="/PATH/TO/LOG_BASE_DIR" \
  --exp_name="val_7b" \
  --balance_sample \
  --dataset="reason_seg" \
  --sample_rates="13" \
  --val_dataset "revos_valid" \
  --eval_only 
</details> <details open> <summary> <strong>2. Using <a href="https://github.com/dvlab-research/LLaMA-VID">LLaMA-VID</a> to generate target frame for each video</a></strong> </summary>

You can directly download the results of our run from OneDrive or BaiduPan

</details> <details open> <summary> <strong>3. Using <a href="https://github.com/cilinyan/VISA/blob/main/XMem/tracking.py">XMem</a> for mask propagation <a href="https://github.com/cilinyan/VISA/blob/c53d2cd31407eab583c5eb04f84fd95b4694f2ce/XMem/tracking.py#L103-L110">[demo]</a> </strong> </summary> </details> <details open> <summary> <strong>4. Evaluate ReVOS's performance <a href="https://github.com/cilinyan/VISA/blob/main/tools/eval_revos.py#L74-L81">[demo]</a> </strong> </summary>
cd tools
python eval_revos.py /PATH/TO/FINAL_ANNOTATION [ARGS]
</details>

šŸ“‘ Todo list

ā­ Cite

If you find this project useful in your research, please consider citing:

@article{yan2024visa,
  title={VISA: Reasoning Video Object Segmentation via Large Language Models},
  author={Yan, Cilin and Wang, Haochen and Yan, Shilin and Jiang, Xiaolong and Hu, Yao and Kang, Guoliang and Xie, Weidi and Gavves, Efstratios},
  journal={arXiv preprint arXiv:2407.11325},
  year={2024}
}

šŸŽ–ļø Acknowledgement

This work is built upon the LLaVA, SAM, LISA, Chat-UniVi, MeViS, LLaMA-VID and XMem.