Awesome
Segment Anything for Videos: A Systematic Survey
The first survey for : Segment Anything for Videos: A Systematic Survey. Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan. [ArXiv][ChinaXiv][ResearchGate][Project][中文解读]
<p align="justify"> Abstract: The recent wave of foundation models has witnessed tremendous success in computer vision (CV) and beyond, with the segment anything model (SAM) having sparked a passion for exploring task-agnostic visual foundation models. Empowered by its remarkable zero-shot generalization, SAM is currently challenging numerous traditional paradigms in CV, delivering extraordinary performance not only in various image segmentation and multi-modal segmentation (e.g., text-to-mask) tasks, but also in the video domain. Additionally, the latest released SAM 2 is once again sparking research enthusiasm in the realm of promptable visual segmentation for both images and videos. However, existing surveys mainly focus on SAM in various image processing tasks, a comprehensive and in-depth review in the video domain is notably absent. To address this gap, this work conducts a systematic review on SAM for videos in the era of foundation models. As the first to review the progress of SAM for videos, this work focuses on its applications to various tasks by discussing its recent advances, and innovation opportunities of developing foundation models on broad applications. We begin with a brief introduction to the background of SAM and video-related research domains. Subsequently, we present a systematic taxonomy that categorizes existing methods into three key areas: video understanding, video generation, and video editing, analyzing and summarizing their advantages and limitations. Furthermore, comparative results of SAM-based and current state-of-the-art methods on representative benchmarks, as well as insightful analysis are offered. Finally, we discuss the challenges faced by current research and envision several future research directions in the field of SAM for video and beyond. </p>
This project will be continuously updated. We expect to include more state-of-the-arts on SAM for videos.
We welcome authors of related works to submit pull requests and become a contributor to this project.
The first comprehensive SAM survey: A Comprehensive Survey on Segment Anything Model for Vision and Beyond is at [here].
:fire: Highlights
- 2024.07.31: The first survey on SAM for videos was online.
- 2024.07.30: The SAM 2 was released.
Citation
If you find our work useful in your research, please consider citing:
@article{chunhui2024samforvideos,
title={Segment Anything for Videos: A Systematic Survey},
author={Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan},
journal={arXiv},
year={2024}
}
Contents
Video Understanding
Video Object Segmentation
Video Object Tracking
Video Shadow Detection
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Detect Any Shadow: Segment Anything for Video Shadow Detection | github | arXiv-2023 |
Deepfake
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization | github | arXiv-2023 |
Miscellaneous
Audio-Visual Segmentation
Referring Video Object Segmentation
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation | github | arXiv-2023 |
Domain Specific
Medical Videos
Domain Adaptation
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Learning from SAM: Harnessing a Segmentation Foundation Model for Sim2Real Domain Adaptation through Regularization | - | arXiv-2023 | |
SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation | github | arXiv-2023 |
Tool Software
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models | - | arXiv-2023 |
More Directions
Video Generation
Video Synthesis
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model | - | arXiv-2023 | |
DisCo: Disentangled Control for Realistic Human Dance Generation | github | arXiv-2023 |
Video Super-Resolution
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Can SAM Boost Video Super-Resolution? | arXiv-2023 |
3D Reconstruction
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
SAM3D: Segment Anything in 3D Scenes | github | arXiv-2023 | |
A One Stop 3D Target Reconstruction and multilevel Segmentation Method | github | arXiv-2023 |
Video Dataset Annotation Generation
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Scalable Mask Annotation for Video Text Spotting | github | arXiv-2023 | |
Audio-Visual Instance Segmentation | - | arXiv-2023 | |
Learning the What and How of Annotation in Video Object Segmentation | github | WACV-2023 | |
Propagating Semantic Labels in Video Data | github | arXiv-2023 | |
Stable Yaw Estimation of Boats from the Viewpoint of UAVs and USVs | - | arXiv-2023 | |
github | arXiv-2023 |
Video Editing
Generic Video Editing
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts | github | arXiv-2023 |
Text Guided Video Editing
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
CVPR 2023 Text Guided Video Editing Competition | github | arXiv-2023 |
Object Removing
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation with Neural Radiance Fields | - | arXiv-2023 |
License
This project is released under the MIT license. Please see the LICENSE file for more information.