


The official PyTorch implementation of our CVPR 2023 paper:

Generalized Relation Modeling for Transformer Tracking

Shenyuan Gao, Chunluan Zhou, Jun Zhang

[CVF Open Access] [ArXiv Preprint] [YouTube Video] [Trained Models] [Raw Results] [SOTA Paper List]


:bookmark:Brief Introduction

Compared with previous two-stream trackers, the recent one-stream tracking pipeline, which allows earlier interaction between the template and search region, has achieved a remarkable performance gain. However, existing one-stream trackers always let the template interact with all parts inside the search region throughout all the encoder layers. This could potentially lead to target-background confusion when the extracted feature representations are not sufficiently discriminative. To alleviate this issue, we propose generalized relation modeling (GRM) based on adaptive token division. The proposed method is a generalized formulation of attention-based relation modeling for Transformer tracking, which inherits the merits of both previous two-stream and one-stream pipelines whilst enabling more flexible relation modeling by selecting appropriate search tokens to interact with template tokens.

:bookmark:Strong Performance

Model ConfigViT-B, 256^2 resolutionViT-B, 256^2 resolutionViT-L, 320^2 resolution
Training Settingonly GOT, 100 epochs4 datasets, 300 epochs4 datasets, 300 epochs
GOT-10k (AO / SR 0.5 / SR 0.75)73.4 / 82.9 / 70.4--
LaSOT (AUC / Norm P / P)-69.9 / 79.3 / 75.871.4 / 81.2 / 77.9
TrackingNet (AUC / Norm P / P)-84.0 / 88.7 / 83.384.4 / 88.9 / 84.0
AVisT (AUC / OP50 / OP75)-54.5 / 63.1 / 45.255.1 / 63.8 / 46.9
NfS30 (AUC)-65.666.0
UAV123 (AUC)-70.272.2

:bookmark:Inference Speed

Our baseline model (backbone: ViT-B, resolution: 256x256) can run at 45 fps (frames per second) on a single NVIDIA GeForce RTX 3090.

:bookmark:Training Cost

It takes less than half a day to train our baseline model for 300 epochs on 8 NVIDIA GeForce RTX 3090 (each of which has 24GB GPU memory).


Trained Models (including the baseline model GRM, GRM-GOT and a stronger variant GRM-L320) [download zip file]

Raw Results (including raw tracking results on six datasets we benchmarked in the paper and listed above) [download zip file]

Download and unzip these two zip files into the output directory under GRM project path, then both of them can be directly used by our code.

Let's Get Started


:heart::heart::heart:Our idea is implemented base on the following projects. We really appreciate their excellent open-source works!


If any parts of our paper and code help your research, please consider citing us and giving a star to our repository.

  title={Generalized Relation Modeling for Transformer Tracking},
  author={Gao, Shenyuan and Zhou, Chunluan and Zhang, Jun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},


If you have any questions or concerns, feel free to open issues or directly contact me through the ways on my GitHub homepage. Suggestions and collaborations are also highly welcome!