Awesome

VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation

Paper

VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation

Note

Based on detectron2.
The codes are under projects/ folder, which follows the convention of detectron2.
You can easily import our project to the latest detectron2 by following below.
- inserting projects/VISOLO folder
- updating detectron2/projects/__init__.py
- updating setup.py

Steps

Installation.

Run a docker container and install the repository.
Compile a docker image following docker/README.md.
On top of the docker container, install the repository by the following command.

git clone https://github.com/SuHoHan95/VISOLO.git
cd VISOLO
pip install -e .
pip install -r requirements.txt
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"

Link datasets

COCO

Download the json file(coco_to_ytvis2019.json)

cp coco_to_ytvis2019.json /path_to_coco_dataset/annotations
cd projects/VISOLO
mkdir -p datasets/coco
ln -s /path_to_coco_dataset/annotations datasets/coco/annotations
ln -s /path_to_coco_dataset/train2017 datasets/coco/train2017
ln -s /path_to_coco_dataset/val2017 datasets/coco/val2017

YTVIS 2019

mkdir -p datasets/ytvis_2019
ln -s /path_to_ytvis_2019_dataset/* datasets/ytvis_2019

we expect ytvis_2019 folder to be like

└── ytvis_2019
    ├── train
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── valid
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── test
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── train.json
    ├── valid.json
    └── test.json

Training.

Training using 4 GPUS(TESLA V100-PCIE-32GB)
Pre-training on COCO dataset

python train_net.py --num-gpus 4 --config-file ./configs/base_coco.yaml OUTPUT_DIR ./checkpoint/coco/

Fine-tuning on YTVIS2019 with pre-trained weights on COCO dataset (r50_coco.pth)

python train_net.py --num-gpus 4 --config-file ./configs/base_ytvis_coco.yaml OUTPUT_DIR ./checkpoint/ytvis_2019/ MODEL.WEIGHTS path/to/pre-trained-model.pth

Evaluating.

Evaluating on YTVIS 2019

python train_net.py --eval-only --num-gpus 1 --config-file ./configs/base_ytvis_coco.yaml OUTPUT_DIR ./checkpoint/ytvis_2019/ MODEL.WEIGHTS path/to/model.pth

"results.json" saved in OUTPUT_DIR/inference/

Model Checkpoints (YTVIS 2019)

Due to the small size of YTVIS dataset, the scores may fluctuate even if retrained with the same configuration.

Note: The provided checkpoints are the ones with highest accuracy from multiple training attempts.

backbone	FPS	AP	AP50	AP75	AR1	AR10	download
ResNet-50	40.0	38.6	56.3	43.7	35.7	42.5	model

Video Comparisons

The overall flow of our VISOLO and the comparison of different VIS methods on the YouTube-VIS 2019 dataset are provided at https://youtu.be/j33H7vcJ2uU

License

VISOLO is released under the Apache 2.0 license.

This code is for non-commercial use only.

Citing

If our work is useful in your project, please consider citing us.

@inproceedings{han2022visolo,
  title={VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation},
  author={Han, Su Ho and Hwang, Sukjun and Oh, Seoung Wug and Park, Yeonchool and Kim, Hyunwoo and Kim, Min-Jung and Kim, Seon Joo},
  booktitle =  {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Acknowledgement

We highly appreciate all previous works that influenced our project.
Special thanks to facebookresearch and IFC authors for their wonderful codes that have been publicly released (detectron2, IFC).