Awesome

ViLLa: Video Reasoning Segmentation with Large Language Model

Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao*

While previous studies have explored solutions to integrate reasoning with video segmentation through LLMs, they struggled to effectively model the complex scenes -- characterized by multiple objects, rapid motion, heavy occlusions, and extended durations. ViLLa, Video reasoning segmentation with Large Language Model, demonstrates capability in handling complex reasoning and referring video segmentation. Also, our model shows impressive ability in different temporal understanding benchmarks.

Illustrations of ViLLa.

Our ViLLa is an effective and efficient LMM capable of segmenting and tracking: (a) multiple objects with rapid motion; (b) objects in crowded scenes; (c) objects in long videos with occlusions.

Visualization Results.

Comparison between ViLLa and VISA.

Experiments

Reasoning video segmentation results among ViLLa and previous related works on VideoReasonSeg benchmark. "Seg" refers to "Segmentation" while "MC" indicates "Multiple Choices".