Home

Awesome

...Repository still under construction...

Predicting the Best of the N Visual Trackers

<p align="center"> <a href="https://arxiv.org/abs/2407.15707"><img src="https://img.shields.io/badge/arXiv-Paper_Link-blue"></a> </p>

Structure of the Proposed BofN

Abstract

We observe that the performance of SOTA visual trackers surprisingly strongly varies across different video attributes and datasets. No single tracker remains the best performer across all tracking attributes and datasets. To bridge this gap, for a given video sequence, we predict the “Best of the N Trackers”, called the BofN meta-tracker. At its core, a Tracking Performance Prediction Network (TP2N) selects a predicted best performing visual tracker for the given video sequence using only a few initial frames. We also introduce a frame-level BofN meta-tracker which keeps predicting best performer after regular temporal intervals. The TP2N is based on self-supervised learning architectures MocoV2, SwAv, BT, and DINO; experiments show that the DINO with ViT-S as a backbone performs the best. The video-level BofN meta-tracker outperforms, by a large margin, existing SOTA trackers on nine standard benchmarks – LaSOT, TrackingNet, GOT-10K, VOT2019, VOT2021, VOT2022, UAV123, OTB100, and WebUAV-3M. Further improvement is achieved by the frame-level BofN meta-tracker effectively handling variations in the tracking scenarios within long sequences. For instance, on GOT-10k, BofN meta-tracker average overlap is 88.7% and 91.1% with video and frame-level settings respectively. The best performing tracker, RTS, achieves 85.20% AO. On VOT2022, BofN expected average overlap is 67.88% and 70.98% with video and frame level settings, compared to the best performing ARTrack, 64.12%. This work also presents an extensive evaluation of competitive tracking methods on all commonly used benchmarks, following their protocols.

Methodology

Please find details in our paper which can be accessed here.

This work utilized the following trackers and others:

ARDiMP | KeepTrack | STMTrack | TransT | ToMP | RTS | SparseTT

Results

Results

Environment Setup

  1. Create the python environment
conda create -y --name n_trackers python==3.7.16
conda activate n_trackers  
  1. Install pytorch and torchvision
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
  1. Install other packages
pip install -r requirements.txt

Training the Tracker Predictor

  1. The following datasets has been utilized to train the tracker predictor
  1. Download the datasets above and put them in the testing_datasets folder.For ease of training, some of the video folders have been renamed, especially for the TrackingNet dataset (see the excel file below).

  2. The excel file all_train_LASOT_GOT10k_TrackingNet_new.xlsx contains the tracking success rate of the trackers on the videos in the dataset. This result is used to generate the output of the predictor where the tracker with the highest success rate is 1 and the others, 0 for each video.

  3. Training Predictor With ResNet Backbone

python classifier_all_data_1.py
  1. Training Predictor With Vision Transformer (ViT) Backbone
python classifier_all_data_2.py

Testing/Tracking with the Predicted Tracker

  1. To track, first download the pre-trained models of the 7 base trackers from the below links and put them in the trained trackers folder.

    1. ARDiMP
    2. KeepTrack
    3. STMTrack
    4. TransT
    5. ToMP
    6. RTS
    7. SparseTT
  2. To track, simply run the main_eval.py file. Tracking results will be found in the tracker_results folder.

python main_eval.py

NOTE: This will run the main trackers and also run the best of them on the videos using both ResNet and ViT backbones.

Citation

If you find our work useful for your research, please consider citing:

@article{Alawode2024,
    archivePrefix = {arXiv},
    arxivId = {2407.15707},
    author = {Alawode, Basit and Javed, Sajid and Mahmood, Arif and Matas, Jiri},
    eprint = {2407.15707},
    number = {8},
    pages = {1--12},
    title = {{Predicting the Best of N Visual Trackers}},
    url = {http://arxiv.org/abs/2407.15707},
    volume = {14},
    year = {2024}
}