Home

Awesome

DeVIS: Making Deformable Transformers Work for Video Instance Segmentation

This repository provides the official implementation of the DeVIS: Making Deformable Transformers Work for Video Instance Segmentation paper by Adrià Caelles, Tim Meinhardt, Guillem Brasó and Laura Leal-Taixe. The codebase builds upon Deformable DETR, VisTR and TrackFormer.

<!-- **As the paper is still under submission this repository will continuously be updated and might at times not reflect the current state of the [arXiv paper](https://arxiv.org/abs/2012.01866).** --> <div align="center"> <img src="docs/devis_method.png" width="800"/> </div>

Abstract

Video Instance Segmentation (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences. In the past, VIS methods mirrored the fragmentation of these subtasks in their architectural design, hence missing out on a joint solution. Transformers recently allowed to cast the entire VIS task as a single set-prediction problem. Nevertheless, the quadratic complexity of existing Transformer-based methods requires long training times, high memory requirements, and processing of low-single-scale feature maps. Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored. In this work, we present Deformable VIS (DeVIS), a VIS method which capitalizes on the efficiency and performance of deformable Transformers. To reason about all VIS subtasks jointly over multiple frames, we present temporal multi-scale deformable attention with instance-aware object queries. We further introduce a new image and video instance mask head with multi-scale features, and perform near-online video processing with multi-cue clip tracking. DeVIS reduces memory as well as training time requirements, and achieves state-of-the-art results on the YouTube-VIS 2021, as well as the challenging OVIS dataset.

Results

Click on the evaulation benchmark you want to see!

<details><summary>COCO</summary><p>
ModelBackbonebox APmask APAP50AP75APlAPmApsFPS
Mask R-CNNR5041.037.258.539.853.339.418.621.4
Mask2FormerR50-43.7--64.847.223.413.5
OursR5046.338.061.440.159.841.417.912.1
Mask R-CNNR10142.938.660.441.355.341.319.4-
Mask2FormerR101-44.2--67.747.723.8-
OursR10147.939.963.042.161.543.919.9-
Mask2FormerR101-50.1--72.153.931.0-
OursSwinL54.645.261.440.159.841.417.9-
</p></details> <details><summary>YouTube-VIS-2019</summary><p>
ModelBackboneAPAP50AP75AR1AR10FPS
VisTRR5036.259.836.937.242.469.9
IFCR5041.265.144.642.349.6107.1
SeqFormerR5045.166.950.545.654.6-
Mask2FormerR5046.468.050.0---
Ours (T=6, S=4)R5044.467.948.642.451.618.4
SeqFormerSwinL59.382.166.451.764.4-
Mask2FormerSwinL60.484.467.0---
Ours (T=6, S=4)SwinL57.180.866.350.861.0-
</p></details> <details><summary>YouTube-VIS-2021</summary><p>
ModelBackboneAPAP50AP75AR1AR10
IFCR5035.257.237.5--
SeqFormerR5040.562.443.736.148.1
Mask2FormerR5040.660.941.8--
Ours (T=6, S=4)R5043.166.846.638.050.1
SeqFormerSwinL51.874.658.242.858.1
Mask2FormerSwinL52.676.457.2--
Ours (T=6, S=4)SwinL54.477.759.843.857.8
</p></details> <details><summary>OVIS</summary><p>
ModelBackboneAPAP50AP75AR1AR10
CrossVisR5014.932.712.110.319.8
TeViTR5017.434.915.011.221.8
Ours (T=6, S=4)R5023.747.620.812.028.9
Ours (T=6, S=4)SwinL35.559.338.316.639.8
</p></details>

Configuration

Our configuration system is based on YACS (similar as detectron2). We hope this allows the research community to more easily build upon our method. Refer to src/config.py to get an overview of all the configuration options available including how the model is built, training and test options. All the default config values correspond to the Deformable DETR + iterative bounding box refinement model, making easier for the user to understand the changes we have introduced upon it. On the other hand, config values that are unique to DeVIS are set to YT-19 model. We use uppercase words (e.g. MODEL.NUM_QUERIES) to refer to config parameters.

Install

We refer to our docs/INSTALL.md for detailed installation instructions.

Train

We refer to our docs/TRAIN.md for detailed training instructions.

Evaluate

To evaluate model's performance, you just need to add the --eval-only argument and set MODEL.WEIGHTS to the checkpoint path via command line. For example, the following command shows how to obtain YT-19 val predictions:

python main.py --config-file configs/devis/YT-19/devis_R_50_YT-19.yaml --eval-only MODEL.WEIGHTS /path/to/yt-19_checkpoint_file

We also support multi GPU test, so you only need to set --nproc_per_node to the number of GPUs desired.

torchrun --nproc_per_node=4 main.py --config-file configs/devis/YT-19/devis_R_50_YT-19.yaml --eval-only MODEL.WEIGHTS /path/to/yt-19_checkpoint_file

Furthermore, we have added the option to validate several checkpoints once the training finishes by simply pointing TEST.INPUT_FOLDER to the output training directory and TEST.EPOCHS_TO_EVAL to the epochs you want to validate.

Visualize results

When TEST.VIZ.OUT_VIZ_PATH=path/to/save is specified, the visual results from the .json file will be saved. Additionally, TEST.VIZ.SAVE_CLIP_VIZ allows saving results from the sub-clips (without the clip tracking being involved). Finally, TEST.VIZ.SAVE_MERGED_TRACKS=True plots all tracks on the same image (same as figures from the paper).

We provide an additional config file that changes thresholds to get more visual appealing results as well as TEST.VIZ.VIDEO_NAMES to infer only the specified videos (the ones shown below). The following command shows how to get visual results from YT-21 val set:

python main.py --config-file configs/devis/devis_R_50_visualization_YT-21.yaml --eval-only MODEL.WEIGHTS /path/to/yt-21_checkpoint_file

To generate the video, you just need to then enter to the output folder containing all the images and use:

ffmpeg -framerate 5 -pattern_type glob -i '*.jpg' -c:v libx264 -pix_fmt yuv420p out.mp4
<div align="center"> <img src="docs/fish.gif" alt="MOT17-03-SDP" width="375"/> <img src="docs/surfer.gif" alt="MOTS20-07" width="375"/> </div>

Attention maps

We also provide an additional script visualize_att_maps.py to generate attention maps. We recommend using the aforementioned visualization config file. The script allows to choose the decoder layer as well as whether to merge resolutions or not (see args_parse() for more info).

python visualize_att_maps.py --config-file configs/devis/devis_R_50_visualization_YT-21.yaml --merge-resolution 1 MODEL.WEIGHTS /path/to/yt-21_checkpoint_file
<div align="center"> <img src="docs/attention_maps.png" width="800"/> </div>

Publication

If you use this software in your research, please cite our publication:

@article{devis,
  author = {Caelles, Adrià and Meinhardt, Tim and Brasó, Guillem and Leal-Taixé, Laura},
  title = {{DeVIS: Making Deformable Transformers Work for Video Instance Segmentation}},
  journal = {arXiv:2207.11103},
  year = {2022},
}