Home

Awesome

VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation

<div align="center"> <img src="VISOLO.png"/> </div>

Paper

VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation

Note

Steps

  1. Installation.
git clone https://github.com/SuHoHan95/VISOLO.git
cd VISOLO
pip install -e .
pip install -r requirements.txt
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"
  1. Link datasets

COCO

Download the json file(coco_to_ytvis2019.json)

cp coco_to_ytvis2019.json /path_to_coco_dataset/annotations
cd projects/VISOLO
mkdir -p datasets/coco
ln -s /path_to_coco_dataset/annotations datasets/coco/annotations
ln -s /path_to_coco_dataset/train2017 datasets/coco/train2017
ln -s /path_to_coco_dataset/val2017 datasets/coco/val2017

YTVIS 2019

mkdir -p datasets/ytvis_2019
ln -s /path_to_ytvis_2019_dataset/* datasets/ytvis_2019

we expect ytvis_2019 folder to be like

└── ytvis_2019
    ├── train
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── valid
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── test
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── train.json
    ├── valid.json
    └── test.json
  1. Training.
python train_net.py --num-gpus 4 --config-file ./configs/base_coco.yaml OUTPUT_DIR ./checkpoint/coco/
python train_net.py --num-gpus 4 --config-file ./configs/base_ytvis_coco.yaml OUTPUT_DIR ./checkpoint/ytvis_2019/ MODEL.WEIGHTS path/to/pre-trained-model.pth
  1. Evaluating.

Evaluating on YTVIS 2019

python train_net.py --eval-only --num-gpus 1 --config-file ./configs/base_ytvis_coco.yaml OUTPUT_DIR ./checkpoint/ytvis_2019/ MODEL.WEIGHTS path/to/model.pth

"results.json" saved in OUTPUT_DIR/inference/

Model Checkpoints (YTVIS 2019)

Due to the small size of YTVIS dataset, the scores may fluctuate even if retrained with the same configuration.

Note: The provided checkpoints are the ones with highest accuracy from multiple training attempts.

backboneFPSAPAP50AP75AR1AR10download
ResNet-5040.038.656.343.735.742.5model

Video Comparisons

The overall flow of our VISOLO and the comparison of different VIS methods on the YouTube-VIS 2019 dataset are provided at https://youtu.be/j33H7vcJ2uU

License

VISOLO is released under the Apache 2.0 license.

This code is for non-commercial use only.

Citing

If our work is useful in your project, please consider citing us.

@inproceedings{han2022visolo,
  title={VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation},
  author={Han, Su Ho and Hwang, Sukjun and Oh, Seoung Wug and Park, Yeonchool and Kim, Hyunwoo and Kim, Min-Jung and Kim, Seon Joo},
  booktitle =  {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Acknowledgement

We highly appreciate all previous works that influenced our project.
Special thanks to facebookresearch and IFC authors for their wonderful codes that have been publicly released (detectron2, IFC).