Awesome
Correlation-Aware Deep Tracking (SBT)
Fei Xie, Chunyu Wang, Guangting Wang, Yue Cao, Wankou Yang, Wenjun Zeng
:star: This is the official reproduced version of our CVPR2022 work "Correlation-Aware Deep Tracking".
:star: For an improved single-branch tracking model, SuperSBT, please go to another github repository!
(a1) standard Siamese-like feature extraction; (a2) our target-dependent feature extraction; (b1) correlation step, such as Siamese cropping correlation [23], DCF [11] and Transformer-based correlation [5] ; (b2) our pipeline removes separated correlation step; (c) prediction stage; (d1)/(d2) are the TSNE [38] visualizations of search features in (a1)/(a2) when feature networks go deeper
(a) architecture of our proposed Single Branch Transformer for tracking. Different from Siamese, DCF and Transformer-based methods, it does not have a standalone module for computing correlation. Instead, it embeds correlation in all Cross-Attention layers which exist at different levels of the networks. The fully fused features of the search image are directly fed to Classification Head (Cls Head) and Regression Head (Reg Head) to obtain localization and size embedding maps. (b) shows the structure of a Extract-or-Correlation (EoC) block. (c) shows the difference of EoC-SA and EoC-CA. PaE denotes patch embedding. LN denotes layer normalization.
Abstract
Robustness and discrimination power are two fundamental requirements in visual object tracking. In most tracking paradigms, we find that the features extracted by the popular Siamese-like networks cannot fully discriminatively model the tracked targets and distractor objects, hindering them from simultaneously meeting these two requirements. While most methods focus on designing robust correlation operations, we propose a novel target-dependent feature network inspired by the self-/cross-attention scheme. In contrast to the Siamese-like feature extraction, our network deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it is able to suppress non-target features, resulting in instance-varying feature extraction. The output features of the search image can be directly used for predicting target locations without extra correlation step. Moreover, our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods. Extensive experiments show our method achieves the state-of-the-art results while running at real-time. Our feature networks also can be applied to existing tracking pipelines seamlessly to raise the tracking performance.
Model file and results
models and raw results can be downloaded from Baidu NetDisk (password:ne0x):
[Models, Raw resuls and Training logs(password:ne0x)]
Results
We obtain the state-of-the-art results on several benchmarks while running at high speed. More results are coming soon.
<table> <tr> <th>Model</th> <th>GOT-10k<br>AO (%)</th> <th>GOT-10k<br>SR0.5 (%)</th> <th>GOT-10k<br>SR0.75 (%)</th> <th>Speed<br></th> <th>Params<br></th> </tr> <tr> <td>SBT-base</td> <td>69.7</td> <td>79.9</td> <td>64.1</td> <td>40fps</td> <td>25.1M</td> </tr> <tr> </table> <table> <tr> <th>Model</th> <th>LaSOT<br>AUC (%)</th> <th>LaSOT<br>Precision</th> <th>LaSOT<br>Norm. Precision</th> <th>Speed<br></th> <th>Params<br></th> </tr> <tr> <td>SBT-base</td> <td>68.0</td> <td>73.9</td> <td>77.8</td> <td>40fps</td> <td>25.1M</td> </tr> <tr> </table>Install dependencies
-
Docker image
We also provide a docker image for reproducing our results: jaffe03/dualtfrpp:latest
-
Create and activate a conda environment
conda create -n SBT python=3.7 conda activate SBT
-
Install PyTorch
conda install -c pytorch pytorch=1.6 torchvision=0.7.1 cudatoolkit=10.2
-
Install other packages
conda install matplotlib pandas tqdm pip install opencv-python tb-nightly visdom scikit-image tikzplotlib gdown conda install cython scipy sudo apt-get install libturbojpeg pip install pycocotools jpeg4py pip install wget yacs pip install shapely==1.6.4.post2 pip install mmcv timm
-
Setup the environment
Create the default environment setting files.
For training
- Full dataset training (lasot, got10k, coco, trackingnet):
python -m torch.distributed.launch --nproc_per_node 8 lib/train/run_training_sbt.py --script sbt --config sbt_base --save_dir ./
- got10k dataset training (lasot, got10k, coco, trackingnet):
python -m torch.distributed.launch --nproc_per_node 8 lib/train/run_training_sbt.py --script sbt --config sbt_base_got --save_dir ./
For testing
- For examplem, in lasot testing set:
python ./tracking/test.py --tracker_name sbt --tracker_param sbt_base --dataset lasot --threads 0 python ./tracking/analysis_results_ITP.py --script sbt --config sbt_base
Acknowledgement
This is a modified version of the python framework PyTracking based on Pytorch, also borrowing from PySOT, GOT-10k and Vision Transformer, such as Swin Transformer, PVT, Twins. We would like to thank their authors for providing great code and framework.
Contacts
-
Fei Xie, Shanghai Jiao Tong University, China, 372998044@qq.com
Citing SBT
If you find SBT useful in your research, please consider citing:
@inproceedings{xie2022sbt,
title={Correlation-aware deep tracking},
author={Xie, Fei and Wang, Chunyu and Wang, Guangting and Cao, Yue and Yang, Wankou and Zeng, Wenjun},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2022}
}