Awesome
Frame-Event Alignment and Fusion Network for High Frame Rate Tracking
<p align="center"> <a href="https://youtu.be/W7EjOiGMiAQ"> <img src="./figure/youtube.png" alt="youtube_video" width="800"/> </a> </p>This is the code for the CVPR23 paper Frame-Event Alignment and Fusion Network for High Frame Rate Tracking (PDF) by Jiqing Zhang, Yuanchen Wang, Wenxi Liu, Meng Li, Jinpeng Bai, Baocai Yin, and Xin Yang.
If you use any of this code, please cite the following publication:
@InProceedings{Zhang_2023_CVPR,
author = {Zhang, Jiqing and Wang, Yuanchen and Liu, Wenxi and Li, Meng and Bai, Jinpeng and Yin, Baocai and Yang, Xin},
title = {Frame-Event Alignment and Fusion Network for High Frame Rate Tracking},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {9781-9790}
}
Abstract
Most existing RGB-based trackers target low frame rate benchmarks of around 30 frames per second. This setting restricts the tracker’s functionality in the real world, especially for fast motion. Event-based cameras as bioinspired sensors provide considerable potential for high frame rate tracking due to their high temporal resolution. However, event-based cameras cannot offer fine-grained texture information like conventional cameras. This unique complementarity motivates us to combine conventional frames and events for high frame rate object tracking under various challengingconditions. Inthispaper, we propose an end-to-end network consisting of multi-modality alignment and fusion modules to effectively combine meaningful information from both modalities at different measurement rates. The alignment module is responsible for cross-style and cross-frame-rate alignment between frame and event modalities under the guidance of the moving cues furnished by events. While the fusion module is accountable for emphasizing valuable features and suppressing noise information by the mutual complement between the two modalities. Extensive experiments show that the proposed approach outperforms state-of-the-art trackers by a significant margin in high frame rate tracking. With the FE240hz dataset, our approach achieves high frame rate tracking up to 240Hz.
<div align="center"> <table> <tr> <td> <img src="./figure/fe-hdr.gif" alt="图片1"> </td> <td> <img src="./figure/fe-severe.gif" alt="图片2"> </td> </tr> <tr> <td align="center">FE240hz Dataset: HDR</td> <td align="center">FE240hz Dataset: Severe Motion</td> </tr> </table> </div> <div align="center"> <table> <tr> <td> <img src="./figure/vis-rigid.gif" alt="图片1"> </td> <td> <img src="./figure/vis-nonrigid.gif" alt="图片2"> </td> </tr> <tr> <td align="center">VisEvent Dataset: Rigid Object</td> <td align="center">VisEvent Dataset: Non-Rigid Object</td> </tr> </table> </div>Content
This document describes the usage and installation for this repository.<br>
- Installation<br>
- Preparing Dataset<br>
- Training<br>
- Evaluation<br>
- Acknowledgments<br>
Installation
The code is based on pytracking and tested on Ubuntu 20.04 with Python 3.8 and PyTorch 1.8.1.
- We recommend using conda to build the environment:
conda create -n <env_name>
- Install the dependent packages:
pip install -r requirements.txt
- Install deformable convolution according to EDVR:
python setup.py develop
Preparing Dataset
We evaluate AFNet on two dataset: FE240hz and VisEvent. The FE240hz dataset has annotation frequencies as high as 240 Hz. With this dataset, our method can accomplish a high frame rate tracking of 240Hz. Compared with FE240hz, VisEvent provides 25Hz annotation frequency, which contains various rigid and non-rigid targets both indoors and outdoors.
- For the FE240hz dataset, we split a sequence into multiple subsequences of length 2000 to avoid the sequence being too long. We accumulate events using this file.
- For the VisEvent dataset, we remove sequences that miss event data or have misaligned timestamps, leaving 205 sequences for training and 172 for testing. We adopt a different event representation method for VisEvent to validate the generalization of AFNet: <img src="./figure/stack-vis.png" alt="图片1" width="50%">
Training
-
cd ltr
and change--workspace_dir
and--data_dir
in ./admin/local.py. -
Run
python run_training.py afnet afnet
to train our AFNet.
Evaluation
-
cd pytracking
-
Change your local path in ./evaluation/local.py
-
run
python run_tracker.py dimp afnet --dataset eotb --sequence val --epochname your_checkpoint.pth.tar
the predicted bounding boxes are be saved in ./tracking_result.- The predicted bounding box format: An N×4 matrix with each line representing object location [xmin, ymin, width, height] in one event frame.
Acknowledgments
- Thanks for the great visionml/pytracking module.
- Thanks for the great EDSR module.