Home

Awesome

OVTrack: Open-Vocabulary Multiple Object Tracking (CVPR 2023)

paper | project page

News and Updates

Evaluate your tracker on open-vocabulary MOT benchmark

If you want to compare with OVTrack and evaluate your own tracker's results on TAO TETA benchmark, Open-vocabulary MOT benchmark and BDD100K MOT and MOTS benchmarks. Please refer to the TETA repo for quick evaluation.

Abstract

The ability to recognize, localize and track dynamic objects in a scene is fundamental to many real-world applications, such as self-driving and robotic systems. Yet, traditional multiple object tracking (MOT) benchmarks rely only on a few object categories that hardly represent the multitude of possible objects that are encountered in the real world. This leaves contemporary MOT methods limited to a small set of pre-defined object categories. In this paper, we address this limitation by tackling a novel task, open-vocabulary MOT, that aims to evaluate tracking beyond pre-defined training categories. We further develop OVTrack, an open-vocabulary tracker that is capable of tracking arbitrary object classes. Its design is based on two key ingredients: First, leveraging vision-language models for both classification and association via knowledge distillation; second, a data hallucination strategy for robust appearance feature learning from denoising diffusion probabilistic models. The result is an extremely data-efficient open-vocabulary tracker that sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark, while being trained solely on static images.

OVTrack

<img src="figures/teaser.png" width="500">

We approach the task of open-vocabulary multiple object tracking. During training, we leverage vision-language (VL) models both for generating samples and knowledge distillation. During testing, we track both base and novel classes unseen during training by querying a vision-language model.

Generative VL model

<img src="figures/diffusion-pipeline.png" width="800">

Discriminative VL model

<img src="figures/inference_pipeline.png" width="800">

Main results

Our method outperforms the states of the art on BDD100K, and TAO benchmarks.

TETA benchmark

MethodbackbonepretrainTETALocAAssocAClsAconfigmodel
QDTrack(CVPR21)ResNet-101ImageNet-1K30.050.527.412.1--
TETerResNet-101ImageNet-1K33.351.635.013.2--
OVTrackResNet-50ImageNet-1K34.749.336.718.1cfggoogle drive
OVTrack (dynmaic rcnn threshold )ResNet-50ImageNet-1K36.253.837.317.4cfggoogle drive

Note: The result with dynmaic rcnn threshold is obtained by setting model.roi_head.dynamic_rcnn_thre = True in the config file. It dynamic adjusts rcnn score threshold based on the number of interested classes to track. Please note that the model is the same as the one without dynamic rcnn threshold. The only difference is the rcnn score threshold during inference.

TAO benchmark

TAO benchmarkbackboneTrack AP50Track AP75Track APconfigmodel
SORT-TAO (ECCV 20)ResNet-10113.2----
QDTrack (CVPR21)ResNet-10115.9510.6--
GTR (CVPR 2022)ResNet-10120.4----
TAC (ECCV 2022 )ResNet-10117.75.87.3--
BIV (ECCV 2022)ResNet-10119.67.313.6--
OVTrackResNet-5021.210.615.9cfggoogle drive

Open-vocabulary Results (val set)

MethodClasses BaseClasses NovelData LVISData TAOBase TETANovel TETAconfigmodel
QDTrack27.122.5--
TETer30.325.7--
DeepSORT (ViLD)26.921.1--
Tracktor++ (ViLD)28.322.7--
OVTrack35.527.8cfggoogle drive
OVTrack (dynmaic rcnn threshold)37.128.8cfggoogle drive

Note: The result with dynmaic rcnn threshold is obtained by setting model.roi_head.dynamic_rcnn_thre = True in the config file. It dynamic adjusts rcnn score threshold based on the number of interested classes to track. Please note that the model is the same as the one without dynamic rcnn threshold. The only difference is the rcnn score threshold during inference.

Installation

Please refer to INSTALL.md for installation instructions.

Usages

The repo is still under construction. This is an example usage. Please refer to GET_STARTED.md for dataset preparation and running instructions.

Cite OVTrack

@inproceedings{li2023ovtrack,
  title={OVTrack: Open-Vocabulary Multiple Object Tracking},
  author={Li, Siyuan and Fischer, Tobias and Ke, Lei and Ding, Henghui and Danelljan, Martin and Yu, Fisher},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5567--5577},
  year={2023}
}

Acknowledgement