Home

Awesome

Maintenance PR's Welcome Awesome

Segment Anything for Videos: A Systematic Survey

The first survey for : Segment Anything for Videos: A Systematic Survey. Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan. [ArXiv][ChinaXiv][ResearchGate][Project][中文解读]

<p align="justify"> Abstract: The recent wave of foundation models has witnessed tremendous success in computer vision (CV) and beyond, with the segment anything model (SAM) having sparked a passion for exploring task-agnostic visual foundation models. Empowered by its remarkable zero-shot generalization, SAM is currently challenging numerous traditional paradigms in CV, delivering extraordinary performance not only in various image segmentation and multi-modal segmentation (e.g., text-to-mask) tasks, but also in the video domain. Additionally, the latest released SAM 2 is once again sparking research enthusiasm in the realm of promptable visual segmentation for both images and videos. However, existing surveys mainly focus on SAM in various image processing tasks, a comprehensive and in-depth review in the video domain is notably absent. To address this gap, this work conducts a systematic review on SAM for videos in the era of foundation models. As the first to review the progress of SAM for videos, this work focuses on its applications to various tasks by discussing its recent advances, and innovation opportunities of developing foundation models on broad applications. We begin with a brief introduction to the background of SAM and video-related research domains. Subsequently, we present a systematic taxonomy that categorizes existing methods into three key areas: video understanding, video generation, and video editing, analyzing and summarizing their advantages and limitations. Furthermore, comparative results of SAM-based and current state-of-the-art methods on representative benchmarks, as well as insightful analysis are offered. Finally, we discuss the challenges faced by current research and envision several future research directions in the field of SAM for video and beyond. </p>

This project will be continuously updated. We expect to include more state-of-the-arts on SAM for videos.

We welcome authors of related works to submit pull requests and become a contributor to this project.

The first comprehensive SAM survey: A Comprehensive Survey on Segment Anything Model for Vision and Beyond is at [here].

:fire: Highlights

- 2024.07.31: The first survey on SAM for videos was online.
- 2024.07.30: The SAM 2 was released.

Citation

If you find our work useful in your research, please consider citing:

@article{chunhui2024samforvideos,
  title={Segment Anything for Videos: A Systematic Survey},
  author={Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan},
  journal={arXiv},
  year={2024}
}

Contents

Video Understanding

Video Object Segmentation

TitlearXivGithubPub. & Date
SAM 2: Segment Anything in Images and VideosarXivgithubarXiv-2024
Segment Anything in High QualityarXivgithubNeurIPS-2023
High-Quality Entity SegmentationarXivgithubICCV-2023
Tracking Anything with Decoupled Video SegmentationarXivgithubICCV-2023
DSEC-MOS: Segment Any Moving Object with Moving Ego VehiclearXivgithubarXiv-2023
Matcher: Segment Anything with One Shot Using All-Purpose Feature MatchingarXivgithubarXiv-2023
Personalize Segment Anything Model with One ShotarXivgithubarXiv-2023
UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything ModelarXiv-arXiv-2023
3rd Place Solution for PVUW2023 VSS Track: A Large Model for Semantic Segmentation on VSPWarXiv-arXiv-2023

Video Object Tracking

TitlearXivGithubPub. & Date
Matching Anything by Segmenting AnythingarXivgithubCVPR-2024
Tracking Anything in High QualityarXivgithubarXiv-2023
Tracking Anything with Decoupled Video SegmentationarXivgithubICCV-2023
Segment and Track AnythingarXivgithubarXiv-2023
Segment Anything Meets Point TrackingarXivgithubarXiv-2023
Track Anything: Segment Anything Meets VideosarXivgithubarXiv-2023
SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain AdaptationarXivgithubarXiv-2023
Unifying Foundation Models with Quadrotor Control for Visual Tracking Beyond Object CategoriesarXiv-arXiv-2023
UniQuadric: A SLAM Backend for Unknown Rigid Object 3D Tracking and Light-Weight ModelingarXiv-arXiv-2023
Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained ModelsarXivgithubarXiv-2023
Follow Anything: Open-set detection, tracking, and following in real-timearXivgithubarXiv-2023
SAM for Poultry SciencearXiv-arXiv-2023
ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single Object TrackingarXiv-arXiv-2023
CoDeF: Content Deformation Fields for Temporally Consistent Video ProcessingarXivgithubarXiv-2023

Video Shadow Detection

TitlearXivGithubPub. & Date
Detect Any Shadow: Segment Anything for Video Shadow DetectionarXivgithubarXiv-2023

Deepfake

TitlearXivGithubPub. & Date
Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and LocalizationarXivgithubarXiv-2023

Miscellaneous

Audio-Visual Segmentation

TitlearXivGithubPub. & Date
AV-SAM: Segment Anything Model Meets Audio-Visual Localization and SegmentationarXiv-arXiv-2023
Leveraging Foundation models for Unsupervised Audio-Visual SegmentationarXiv-arXiv-2023
Prompting Segmentation with Sound is Generalizable Audio-Visual Source LocalizerPrompting Segmentation with Sound is Generalizable Audio-Visual Source LocalizerarXiv-arXiv-2023

Referring Video Object Segmentation

TitlearXivGithubPub. & Date
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object SegmentationarXivgithubarXiv-2023

Domain Specific

Medical Videos

TitlearXivGithubPub. & Date
Spatio-Temporal Analysis of Patient-Derived Organoid Videos Using Deep Learning for the Prediction of Drug EfficacyarXiv-ICCV Workshop-2023
SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and AdaptationarXiv-MICCAI MedAGI Workshop-2023
MediViSTA-SAM: Zero-shot Medical Video Analysis with Spatio-temporal SAM AdaptationarXivgithubarXiv-2023
SAMSNeRF: Segment Anything Model (SAM) Guides Dynamic Surgical Scene Reconstruction by Neural Radiance Field (NeRF)arXivgithubarXiv-2023
SuPerPM: A Large Deformation-Robust Surgical Perception Framework Based on Deep Point Matching Learned from Physical Constrained Simulation DataarXiv-arXiv-2023
SurgicalSAM: Efficient Class Promptable Surgical Instrument SegmentationarXivgithubarXiv-2023

Domain Adaptation

TitlearXivGithubPub. & Date
Learning from SAM: Harnessing a Segmentation Foundation Model for Sim2Real Domain Adaptation through RegularizationarXiv-arXiv-2023
SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain AdaptationarXivgithubarXiv-2023

Tool Software

TitlearXivGithubPub. & Date
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language ModelsarXiv-arXiv-2023

More Directions

TitlearXivGithubPub. & Date
Generative AI-driven Semantic Communication Framework for NextG Wireless NetworkarXiv-arXiv-2023
Learning from Human Videos for Robotic ManipulationarXivgithubarXiv-2023
Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud SegmentationarXiv-arXiv-2023
Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place RobotsarXivgithubarXiv-2023
ROSGPT_Vision: Commanding Robots Using Only Language Models' PromptsarXivgithubarXiv-2023
SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything ModelarXiv-arXiv-2023
Virtual Augmented Reality for Atari Reinforcement LearningarXiv-arXiv-2023

Video Generation

Video Synthesis

TitlearXivGithubPub. & Date
Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion ModelarXiv-arXiv-2023
DisCo: Disentangled Control for Realistic Human Dance GenerationarXivgithubarXiv-2023

Video Super-Resolution

TitlearXivGithubPub. & Date
Can SAM Boost Video Super-Resolution?arXivarXiv-2023

3D Reconstruction

TitlearXivGithubPub. & Date
SAM3D: Segment Anything in 3D ScenesarXivgithubarXiv-2023
A One Stop 3D Target Reconstruction and multilevel Segmentation MethodarXivgithubarXiv-2023

Video Dataset Annotation Generation

TitlearXivGithubPub. & Date
Scalable Mask Annotation for Video Text SpottingarXivgithubarXiv-2023
Audio-Visual Instance SegmentationarXiv-arXiv-2023
Learning the What and How of Annotation in Video Object SegmentationarXivgithubWACV-2023
Propagating Semantic Labels in Video DataarXivgithubarXiv-2023
Stable Yaw Estimation of Boats from the Viewpoint of UAVs and USVsarXiv-arXiv-2023
arXivgithubarXiv-2023

Video Editing

Generic Video Editing

TitlearXivGithubPub. & Date
Make-A-Protagonist: Generic Video Editing with An Ensemble of ExpertsarXivgithubarXiv-2023

Text Guided Video Editing

TitlearXivGithubPub. & Date
CVPR 2023 Text Guided Video Editing CompetitionarXivgithubarXiv-2023

Object Removing

TitlearXivGithubPub. & Date
OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation with Neural Radiance FieldsarXiv-arXiv-2023

License

This project is released under the MIT license. Please see the LICENSE file for more information.