Home

Awesome

MITS

Introduction

Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation. [Arxiv]

Tracking any given object(s) spatially and temporally is a common purpose in Visual Object Tracking (VOT) and Video Object Segmentation (VOS). We proposed a Multi-object Mask-box Integrated framework for unified Tracking and Segmentation, dubbed MITS.

Requirements

Getting Started

Training

Data Preparation

Download datasets and re-organize the folders as the following structure:

datasets
└───YTB
    └───2019
        └───train
            └───JPEGImages
            └───Annotations
└───DAVIS
    └───JPEGImages
    └───Annotations
    └───ImageSets
└───LaSOT
    └───JPEGImages
    └───Annotations
    └───BoxAnnotations
└───GOT10K
    └───JPEGImages
    └───Annotations
    └───BoxAnnotations

Note: Although the pseudo masks are used for training by default, MITS can also be trained with mixed annotations without pseudo masks due to its strong compatibility.

Pretrain Weights

MITS is initialized with pretrained DeAOT. Download the R50_DeAOTL_PRE.pth and put it to pretrain_models folder.

Train Models

Run train.sh to launch training. Configs for different training settings are in folder configs. See model zoo for details.

Evaluation

Data Preparation

We follow the original file structure from each dataset. First frame mask for YouTube-VOS/DAVIS and first frame box for LaSOT/TrackingNet/GOT-10K are required for evaluation.

datasets
└───YTB
    └───2019
        └───valid
            └───JPEGImages
            └───Annotations
└───DAVIS
    └───JPEGImages
    └───Annotations
    └───ImageSets
└───LaSOTTest
    └───airplane-1
        └───img
        └───groundtruth.txt
└───TrackingNetTest
    └───JPEGImages
        └───__WaG8fRMto_0
    └───BoxAnnotations
        └───__WaG8fRMto_0.txt
└───GOT10KTest
    └───GOT-10k_Test_000001
        └───00000001.jpg
        ...
        └───groundtruth.txt

Evaluate Models

Run eval_vos.sh to evaluate on YouTube-VOS or DAVIS, eval_vot.sh to evaluate on LaSOT, TrackingNet or GOT10K.

The outputs include predicted masks from mask head, bounding boxes from masks (bbox), predicted boxes from box head (boxh). By default, masks are for VOS benchmarks and boxh boxes are for VOT benchmarks.

Model Zoo

Model Download

ModelTraining DataFile
MITSfullgdrive
MITS_boxno VOT masksgdrive
MITS_gotonly GOT10kgdrive

Evaluation Results

ModelLaSOT Test<br /> AUC/PN/PTrackingNet Test<br /> AUC/PN/PGOT10k Test<br /> AO/SR0.5/SR0.75YouTube-VOS 19 val<br /> GDAVIS 17 val<br /> G
MITS<br /> Prediction file72.1/80.1/78.6<br /> gdrive83.5/88.7/84.5<br /> gdrive78.5/87.5/73.7<br /> gdrive85.9<br /> gdrive84.9<br /> gdrive
MITS_box<br /> Prediction file70.7/78.1/75.8<br /> gdrive83.0/87.8/83.1<br /> gdrive78.0/86.4/71.7<br /> gdrive85.7<br /> gdrive84.3<br /> gdrive
MITS_got<br /> Prediction file--80.4/89.7/75.9<br /> gdrive--

By default, we use box prediction for VOT benchmarks and mask prediction for VOS benchmarks. There might be 0.1 performance difference with those reported in the paper due to the code update.

Acknowledgement

The implementation is heavily based on prior VOS work AOT/DeAOT.

Pseudo-masks for LaSOT and GOT10K for training are taken from RTS.

Citing

@article{xu2023integrating,
  title={Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation},
  author={Xu, Yuanyou and Yang, Zongxin and Yang, Yi},
  journal={arXiv preprint arXiv:2308.13266},
  year={2023}
}
@inproceedings{yang2022deaot,
  title={Decoupling Features in Hierarchical Propagation for Video Object Segmentation},
  author={Yang, Zongxin and Yang, Yi},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2022}
}
@article{yang2021aost,
  title={Scalable Video Object Segmentation with Identification Mechanism},
  author={Yang, Zongxin and Wang, Xiaohan and Miao, Jiaxu and Wei, Yunchao and Wang, Wenguan and Yang, Yi},
  journal={arXiv preprint arXiv:2203.11442},
  year={2023}
}
@inproceedings{yang2021aot,
  title={Associating Objects with Transformers for Video Object Segmentation},
  author={Yang, Zongxin and Wei, Yunchao and Yang, Yi},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2021}
}