Home

Awesome

Maintenance PR's Welcome Awesome

Segment Anything for Videos: A Systematic Survey

The first survey for : Segment Anything for Videos: A Systematic Survey. Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan. [ArXiv][ChinaXiv][ResearchGate][Project][中文解读]

<p align="justify"> Abstract: The recent wave of foundation models has witnessed tremendous success in computer vision (CV) and beyond, with the segment anything model (SAM) having sparked a passion for exploring task-agnostic visual foundation models. Empowered by its remarkable zero-shot generalization, SAM is currently challenging numerous traditional paradigms in CV, delivering extraordinary performance not only in various image segmentation and multi-modal segmentation (e.g., text-to-mask) tasks, but also in the video domain. Additionally, the latest released SAM 2 is once again sparking research enthusiasm in the realm of promptable visual segmentation for both images and videos. However, existing surveys mainly focus on SAM in various image processing tasks, a comprehensive and in-depth review in the video domain is notably absent. To address this gap, this work conducts a systematic review on SAM for videos in the era of foundation models. As the first to review the progress of SAM for videos, this work focuses on its applications to various tasks by discussing its recent advances, and innovation opportunities of developing foundation models on broad applications. We begin with a brief introduction to the background of SAM and video-related research domains. Subsequently, we present a systematic taxonomy that categorizes existing methods into three key areas: video understanding, video generation, and video editing, analyzing and summarizing their advantages and limitations. Furthermore, comparative results of SAM-based and current state-of-the-art methods on representative benchmarks, as well as insightful analysis are offered. Finally, we discuss the challenges faced by current research and envision several future research directions in the field of SAM for video and beyond. </p>

This project will be continuously updated. We expect to include more state-of-the-arts on SAM for videos.

We welcome authors of related works to submit pull requests and become a contributor to this project.

The first comprehensive SAM survey: A Comprehensive Survey on Segment Anything Model for Vision and Beyond is at [here].

:fire: Highlights

Last Updated

- 2024.07.31: The first survey on SAM for videos was online.
- 2024.07.29: The SAM 2 was released.

Citation

If you find our work useful in your research, please consider citing:

@article{chunhui2024samforvideos,
  title={Segment Anything for Videos: A Systematic Survey},
  author={Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan},
  journal={arXiv},
  year={2024}
}

Contents

Video Segmentation

Video Object Segmentation

:boom:Yue, Jiahe and Zhang, Runchu and Zhang, Zhe and Zhao, Ruixiang and Lv, Wu and Ma, Jie.<br /> "How SAM helps Unsupervised Video Object Segmentation?." IJCNN (2024). [paper] [2024.10]

:boom:VideoSAM: Pinxue Guo, Zixu Zhao, Jianxiong Gao, Chongruo Wu, Tong He, Zheng Zhang, Tianjun Xiao, Wenqiang Zhang.<br /> "VideoSAM: Open-World Video Segmentation." ArXiv (2024). [paper] [2024.10]

:boom:SAM2Long: Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, Jiaqi Wang.<br /> "SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree." ArXiv (2024). [paper] [code] [2024.10]

:boom:VideoSAM: Chika Maduabuchi, Ericmoore Jossou, Matteo Bucci.<br /> "VideoSAM: A Large Vision Foundation Model for High-Speed Video Segmentation." ArXiv (2024). [paper] [code] [2024.10]

:boom:MUG-VOS: Sangbeom Lim, Seongchan Kim, Seungjun An, Seokju Cho, Paul Hongsuck Seo, Seungryong Kim.<br /> "Multi-Granularity Video Object Segmentation." ArXiv (2024). [paper] [code] [2024.12]

:boom:SAMs-CDConcepts-Eval: Xiaoqi Zhao, Youwei Pang, Shijie Chang, Yuan Zhao, Lihe Zhang, Huchuan Lu, Jinsong Ouyang, Georges El Fakhri, Xiaofeng Liu.<br /> "Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes." ArXiv (2024). [paper] [code] [2024.12]

Video Semantic Segmentation

Video Instance Segmentation

Video Panoptic Segmentation

3D Video Segmentation

Audio-Visual Segmentation

Referring Video Object Segmentation

:boom:SAMWISE: Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, Giuseppe Averta.<br /> "SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation." ArXiv (2024). [paper] [code] [2024.11]

:boom:SOLA: Seongchan Kim, Woojeong Jin, Sangbeom Lim, Heeji Yoon, Hyunwook Choi, Seungryong Kim.<br /> "Referring Video Object Segmentation via Language-aligned Track Selection." ArXiv (2024). [paper] [code] [2024.12]

Universal Segmentation

Video Detection and Recognition

Video Object Detection

Video Shadow Detection

Video Camouflaged Object Detection

Deepfake Detection

Video Anomaly Detection

Video Salient Object Detection

Event Detection

Action Recognition

Video Activity and Scene Classification

Video Object Tracking

General Object Tracking

:boom:SAMURAI: Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang.<br /> "SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory." ArXiv (2024). [paper] [code] [2024.11]

:boom:GCR: Kuiran Wang, Xuehui Yu, Wenwen Yu, Guorong Li, Xiangyuan Lan, Qixiang Ye, Jianbin Jiao, Zhenjun Han.<br /> "Click; Single Object Tracking; Video Object Segmentation; Real-time Interaction." ArXiv (2024). [paper] [2024.11]

:boom:DAM4SAM: Jovana Videnovic, Alan Lukezic, Matej Kristan.<br /> "A Distractor-Aware Memory for Visual Object Tracking with SAM2." ArXiv (2024). [paper] [code] [2024.11]

:boom:Huanlong Zhang and Xiangbo Yang and Xin Wang and Weiqiang Fu and Bineng Zhong and Jianwei Zhang.<br /> "SAM-Assisted Temporal-Location Enhanced Transformer Segmentation for Object Tracking with Online Motion Inference." Neurocomputing (2024). [paper] [2024.11]

:boom:TABE: Finlay G. C. Hudson, William A. P. Smith.<br /> "Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation." ArXiv (2024). [paper] [2024.11]

:boom:Det-SAM2: Zhiting Wang, Qiangong Zhou, Zongyang Liu.<br /> "Det-SAM2:Technical Report on the Self-Prompting Segmentation Framework Based on Segment Anything Model 2." ArXiv (2024). [paper] [code] [2024.11]

:boom:EfficientTAMs: Yunyang Xiong, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Balakrishnan Varadarajan, Ramya Akula, Forrest Iandola, Raghuraman Krishnamoorthi, Bilge Soran, Vikas Chandra.<br /> "Efficient Track Anything." ArXiv (2024). [paper] [code] [2024.11]

Open-Vocabulary Tracking

Point Tracking

Instruction Tracking

Interactive Tracking and Localization

Domain Specifc Tracking

:boom:TAP: Jia Syuen Lim, Yadan Luo, Zhi Chen, Tianqi Wei, Scott Chapman, Zi Huang.<br /> "Track Any Peppers: Weakly Supervised Sweet Pepper Tracking Using VLMs." ArXiv (2024). [paper] [2024.11]

:boom:Vanhaeverbeke, J.; Decorte, R.; Slembrouck, M.; Hoecke, S.V.; Verstockt, S.<br /> "Point of Interest Recognition and Tracking in Aerial Video during Live Cycling Broadcasts." Applied Sciences(2024). [paper] [2024.12]

Video Editing and Generation

Video Editing

Text Guided Video Editing

Other Modality-guided Video Editing

Domain Spacific Editing

Video Frame Interpolation

3D Video Reconstruction

Video Dataset Annotation Generation

Video Super-Resolution

Text-to-Video Generation

Video Generation with Other Modalities

Others

Video Understanding and Analysis

Video Captioning

Video Dialog

Video Grounding

Optical Flow Estimation

Pose Estimation

Video ReID

Others

:boom:MORA: Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, Chen Chen.<br /> "Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level." ArXiv (2024). [paper] [code] [2024.11]

Medical Video Processing

Other Video Tasks

Video Compression

Robotics

Video Game

Video Style Transfer Attack

Semantic Communication

Tool Software

License

This project is released under the MIT license. Please see the LICENSE file for more information.