Home

Awesome

Maintenance PR's Welcome Awesome

<div align="center"> <img src="https://github.com/983632847/Awesome-Multimodal-Object-Tracking/blob/main/MMOT.png" width="600">

Awesome Multi-modal Object Tracking (MMOT)


<p align="center"> </p> </div>

Awesome Multi-modal Object Tracking (MMOT)

A continuously updated project to track the latest progress in multi-modal object tracking.

If this repository can bring you some inspiration, we would feel greatly honored.

If you like our project, please give us a star ⭐ on this GitHub.

If you have any suggestions, please feel free to contact: andyzhangchunhui@gmail.com.

We welcome other researchers to submit pull requests and become contributors to this project.

:collision: Highlights

Last Updated

Contents

Citation

If you find our work useful in your research, please consider citing:

@article{zhang2024awesome,
  title={Awesome Multi-modal Object Tracking},
  author={Zhang, Chunhui and Liu, Li and Wen, Hao and Zhou, Xi and Wang, Yanfeng},
  journal={arXiv preprint arXiv:2405.14200},
  year={2024}
}

Survey

Vision-Language Tracking

Datasets

DatasetPub. & DateWebSiteIntroduction
OTB99-LCVPR-2017OTB99-L99 videos
LaSOTCVPR-2019LaSOT1400 videos
LaSOT_EXTIJCV-2021LaSOT_EXT150 videos
TNL2KCVPR-2021TNL2K2000 videos
WebUAV-3MTPAMI-2023WebUAV-3M4500 videos, 3.3 million frames, UAV tracking, vision-language-audio
MGITNeurIPS-2023MGIT150 long video sequences, 2.03 million frames, three semantic grains (i.e., action, activity, and story)
VastTrackNeurIPS-2024VastTrack50,610 video sequences, 4.2 million frames, 2,115 classes
WebUOT-1MNeurIPS-2024WebUOT-1MThe first million-scale underwater object tracking dataset contains 1,500 video sequences, 1.1 million frames
ElysiumTrack-1MECCV-2024ElysiumTrack-1MA large-scale dataset that supports three tasks: single object tracking, reference single object tracking, and video reference expression generation, with 1.27 million videos
VLT-MIarXiv-2024-A dataset for multi-round, multi-modal interaction, with 3,619 videos.
UW-COTarXiv-2024UW-COTThe first underwater camouflaged object tracking dataset with 220 videos.
DTVLTarXiv-2024DTVLTA multi-modal diverse text benchmark for visual language tracking (RGBL Tracking).
SemTrackECCV-2024SemTrackA large-scale dataset comprising 6.7 million frames from 6,961 videos, capturing the semantic trajectory of targets across 52 interaction classes and 115 object classes.

Papers

2024

2023

2022

2021

2019

2017

RGBE Tracking

Datasets

DatasetPub. & DateWebSiteIntroduction
FE108ICCV-2021FE108108 event videos
COESOTarXiv-2022COESOT1354 RGB-event video pairs
VisEventTC-2023VisEvent820 RGB-event video pairs
EventVOTCVPR-2024EventVOT1141 event videos
CRSOTarXiv-2024CRSOT1030 RGB-event video pairs
FELTarXiv-2024FELT742 RGB-event video pairs
MEVDTarXiv-2024MEVDT63 multimodal sequences with 13k images, 5M events, 10k object labels and 85 trajectories

Papers

2024

2023

2022

2021

RGBD Tracking

Datasets

DatasetPub. & DateWebSiteIntroduction
PTBICCV-2013PTB100 sequences
STCTC-2018STC36 sequences
CDTBICCV-2019CDTB80 sequences
VOT-RGBD 2019/2020/2021ICCVW-2019VOT-RGBD 2019VOT-RGBD 2019, 2020, and 2021 are based on CDTB
DepthTrackICCV-2021DepthTrack200 sequences
VOT-RGBD 2022ECCVW-2022VOT-RGBD 2022VOT-RGBD 2022 is based on CDTB and DepthTrack
RGBD1KAAAI-2023RGBD1K1,050 sequences, 2.5M frames
DTTDCVPR Workshops-2023DTTD103 scenes, 55691 frames
ARKitTrackCVPR-2023ARKitTrack300 RGB-D sequences, 455 targets, 229.7K video frames

Papers

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

RGBT Tracking

Datasets

DatasetPub. & DateWebSiteIntroduction
GTOTTIP-2016GTOT50 video pairs, 1.5W frames
RGBT210ACM MM-2017RGBT210210 video pairs
RGBT234PR-2018RGBT234234 video pairs, the extension of RGBT210
LasHeRTIP-2021LasHeR1224 video pairs, 730K frames
VTUAVCVPR-2022VTUAVVisible-thermal UAV tracking, 500 sequences, 1.7 million high-resolution frame pairs
MV-RGBTarXiv-2024MV-RGBT122 video pairs, 89.9K frames

Papers

2024

2023

2022

2021

2020

2019

Miscellaneous

Datasets

DatasetPub. & DateWebSiteIntroduction
WebUAV-3MTPAMI-2023WebUAV-3M4500 videos, 3.3 million frames, UAV tracking, Vision-language-audio
UniMod1KIJCV-2024UniMod1K1050 video pairs, 2.5 million frames, Vision-depth-language

Papers

2024

2023

2022

Others

2024

Awesome Repositories for MMOT

License

This project is released under the MIT license. Please see the LICENSE file for more information.