Awesome
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao*
While previous studies have explored solutions to integrate reasoning with video segmentation through LLMs, they struggled to effectively model the complex scenes -- characterized by multiple objects, rapid motion, heavy occlusions, and extended durations. ViLLa, Video reasoning segmentation with Large Language Model, demonstrates capability in handling complex reasoning and referring video segmentation. Also, our model shows impressive ability in different temporal understanding benchmarks.