Home

Awesome

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation (ECCV 2024)

Hao Fang, Peng Wu, Yawei Li, Xinxin Zhang, Xiankai Lu

[paper] [BibTeX]

<div align="center"> <img src="OVFormer.png" width="100%" height="100%"/> </div><br/>

Installation

See installation instructions.

Data Preparation

See Preparing Datasets for OVFormer.

Getting Started

We firstly train the OVFormer model on LVIS dataset:

python train_net.py --num-gpus 4 \
  --config-file configs/lvis/ovformer_R50_bs8.yaml

To evaluate model's zero-shot generalization performance on VIS Datasets, use

python train_net_video.py \
  --config-file configs/youtubevis_2019/ovformer_R50_bs8.yaml \
  --eval-only MODEL.WEIGHTS models/ovformer_r50_lvis.pth

YTVIS19/21 requires splitting the results.json into base and novel categories by Tool, OVIS directly packages and uploads to the specified server, BURST needs to run mAP.py. You are expected to get results like this:

ModelBackboneYTVIS19YTVIS21OVISBURSTweights
OVFormerR-5034.829.815.16.8model
OVFormerSwin-B44.337.621.37.6model

Then, we video-based train the OVFormer model on LV-VIS dataset:

python train_net_lvvis.py --num-gpus 4 \
  --config-file configs/lvvis/video_ovformer_R50_bs8.yaml

To evaluate a model's performance on LV-VIS dataset, use

python train_net_lvvis.py \
  --config-file configs/lvvis/video_ovformer_R50_bs8.yaml \
  --eval-only MODEL.WEIGHTS models/ovformer_r50_lvvis.pth

Run mAP.py, you are expected to get results like this:

ModelBackboneLVVIS valLVVIS testweights
OVFormerR-5021.915.2model
OVFormerSwin-B24.719.5model

<a name="CitingOVFormer"></a>Citing OVFormer

@inproceedings{fang2024unified,
  title={Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation},
  author={Hao, Fang and Peng, Wu and Yawei, Li and Xinxin, Zhang and Xiankai, Lu},
  booktitle={ECCV},
  year={2024},
}

Acknowledgement

This repo is based on detectron2, Mask2Former, and LVVIS. Thanks for their great work!