Awesome
VISA: Reasoning Video Object Segmentation via Large Language Model
<font size=7><div align='center' >
</div></font> <div align=center> <img src="assert/architecture.png" style="width:100%;"> </div>š Performance
<div style="text-align: justify;"> VISA demonstrates remarkable proficiency in handling complex segmentation tasks that require: (a) reasoning based on world knowledge; (b) inference of future events; and (c) a comprehensive understanding of video content. </div> <div align=center> <img src="assert/performance.png" style="width:50%;"> </div>š ļø Installation
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
š¦ Training and Validation
1. Training Data Preparation
Before training, please download the datasets, and then configure the path in dataset_config.py.
<details open> <summary> <strong>LISA's Dataset</strong> </summary>Follow LISA to prepare LISA's datasets. The dataset folder should be stored in the $LISA_ROOT
folder.
LISA_ROOT
āāā ade20k
āāā coco
āāā cocostuff
āāā llava_dataset
āāā mapillary
āāā reason_seg
āāā refer_seg
āāā vlpart
</details>
<details open>
<summary> <strong>Chat-UniVi's Dataset</strong> </summary>
Follow Chat-UniVi/Chat-UniVi-Instruct to prepare Chat-UniVi-Instruct
datasets. The dataset folder should be stored in the $ChatUniVi_ROOT
folder.
ChatUniVi_ROOT
āāā Fine-tuning
ā āāā MIMIC_imageonly
ā āāā VIDEO
āāā ScienceQA_tuning
</details>
<details open>
<summary> <strong>RVOS's Dataset</strong> </summary>
- Reasoning Video Segmentation Datasets: ReVOS.
- Referring Video Segmentation Datasets: Ref-Youtube-VOS, Ref-DAVIS17, MeViS.
- Open-Vocabulary Video Instance Segmentation Dataset: LV-VIS.
Download
mask_dict.json
andmeta_expressions.json
from OneDrive or BaiduPan. Then, put the annotations files in the$RVOS_ROOT/lvvis/train
directory as follows.
RVOS_ROOT
āāā ReVOS
ā āāā JPEGImages
ā āāā mask_dict.json
ā āāā mask_dict_foreground.json
ā āāā meta_expressions_train_.json
ā āāā meta_expressions_valid_.json
āāā lvvis
ā āāā train
| āāā JPEGImages
| āāā mask_dict.json
| āāā meta_expressions.json
āāā Ref-Youtube-VOS
ā āāā meta_expressions
| | āāā train/meta_expressions.json
| | āāā valid/meta_expressions.json
ā āāā train
| | āāā JPEGImages
| | āāā mask_dict.pkl
ā āāā valid
| āāā JPEGImages
āāā davis17
ā āāā meta_expressions
| | āāā train/meta_expressions.json
| | āāā valid/meta_expressions.json
ā āāā train
| | āāā JPEGImages
| | āāā mask_dict.pkl
ā āāā valid
| āāā JPEGImages
| āāā mask_dict.pkl
āāā mevis
</details>
2. Pre-trained weights
<details open> <summary> <strong>Chat-UniVi</strong> </summary>To train VISA-7B or 13B, you need to download Chat-UniVi weights from Chat-UniVi-7B and Chat-UniVi-13B.
</details> <details open> <summary> <strong>SAM</strong> </summary>Download SAM ViT-H pre-trained weights from the link.
</details>3. Training VISA
# Training VISA-7B
bash scripts/train_7b.sh
# Extracting fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints.
cd /PATH/TO/VISA-7B/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin
# Merge LoRA Weight
CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
--version Chat-UniVi/Chat-UniVi \
--weight /PATH/TO/VISA-7B/pytorch_model.bin \
--save_path /PATH/TO/VISA-7B/hf_model
4. Validation
<details open> <summary> <strong>1. Using `VISA` to generate predicted mask of each video <a href="https://github.com/cilinyan/VISA/blob/main/scripts/val_7b_video.sh">[demo]</a></strong> </summary>deepspeed --master_port=24999 train_ds.py \
--version="/PATH/TO/VISA-7B/hf_model" \
--vision_pretrained="/PATH/TO/sam_vit_h_4b8939.pth" \
--log_base_dir="/PATH/TO/LOG_BASE_DIR" \
--exp_name="val_7b" \
--balance_sample \
--dataset="reason_seg" \
--sample_rates="13" \
--val_dataset "revos_valid" \
--eval_only
</details>
<details open>
<summary> <strong>2. Using <a href="https://github.com/dvlab-research/LLaMA-VID">LLaMA-VID</a> to generate target frame for each video</a></strong> </summary>
You can directly download the results of our run from OneDrive or BaiduPan
-
Run http_server_mp.py to build the API server for LLaMA-VID
[demo]
python utils_llamavid/llamavid_server.py \ --vision_tower /PATH/TO/eva_vit_g.pth \ --image_processor /PATH/TO/openai/clip-vit-large-patch14 \ --model-path /PATH/TO/YanweiLi/llama-vid-13b-full-224-video-fps-1
-
Using the API for inference
[demo]
python utils_llamavid/llamavid_client.py \ --video_root /PATH/TO/ReVOS/JPEGImages \ --data_json_file /PATH/TO/ReVOS/meta_expressions_valid_.json
cd tools
python eval_revos.py /PATH/TO/FINAL_ANNOTATION [ARGS]
</details>
š Todo list
-
Release code with
Text-guided Frame Sampler
's Local Sampling -
Release VISA model weights issue #6
-
Release code with
Text-guided Frame Sampler
's Global-Local Sampling
ā Cite
If you find this project useful in your research, please consider citing:
@article{yan2024visa,
title={VISA: Reasoning Video Object Segmentation via Large Language Models},
author={Yan, Cilin and Wang, Haochen and Yan, Shilin and Jiang, Xiaolong and Hu, Yao and Kang, Guoliang and Xie, Weidi and Gavves, Efstratios},
journal={arXiv preprint arXiv:2407.11325},
year={2024}
}
šļø Acknowledgement
This work is built upon the LLaVA, SAM, LISA, Chat-UniVi, MeViS, LLaMA-VID and XMem.