Home

Awesome

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Release

Abstract

Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for visionlanguage pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.

Highlights

Model

<p align="center"> <img src="assets/model.png" width="70%"> </p>

Quick Start

Installation

  1. Prepare environment
conda create -n kangaroo python=3.9 -y
conda activate kangaroo
pip install -r requirements
  1. Install flash-attn
pip install flash-attn --no-build-isolation
  1. Install nvidia apex according to apex

Multi-round Chat with 🤗 Transformers

See chat.ipynb

Streamlit Deploy

We provide code for users to build a web UI demo. Please use streamlit==1.36.0.

streamlit run streamlit_app.py --server.port PORT

Results

Evaluation Results

<p align="center"> <img src="assets/bench.png" width="90%" style="margin: 40;"> </p>

Results on VideoMME

<p align="center"> <img src="assets/videomme.png" width="80%" style="margin: 40;"> </p>

Results on SeedBench-Video

<p align="center"> <img src="assets/seed.png" width="65%" style="margin: 40;"> </p>

Qualitative Examples

<p align="center"> <img src="assets/demo1.png" width="80%" style="margin-top:100px;"> </p> <br><br> <p align="center"> <img src="assets/demo2.png" width="80%" > </p> <br><br> <p align="center"> <img src="assets/demo3.png" width="80%" > </p>

Citation

If you find it useful for your research , please cite related papers/blogs using this BibTeX:


@article{kangaroogroup,
	title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
	author={Liu, Jiajun and Wang, Yibing and Ma, Hanghang and Wu, Xiaoping and Ma, Xiaoqi and Wei, xiaoming and Jiao, Jianbin and Wu, Enhua and Hu, Jie},
	journal={arXiv preprint arXiv:2408.15542},
	year={2024}
}