Home

Awesome

VideoGPT+ :movie_camera: :speech_balloon:

<p align="center"> <img src="docs/images/videogpt_plus_face.jpeg" alt="videogpt_plus_face" width="200"> </p> <p align="center"> <img src="https://i.imgur.com/waxVImv.png" alt="Oryx Video-ChatGPT"> </p>

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Muhammad Maaz , Hanoona Rasheed , Salman Khan and Fahad Khan

Mohamed bin Zayed University of Artificial Intelligence


paper video Dataset Demo


Diverse Video-based Generative Performance Benchmarking (VCGBench-Diverse)

PWC

Video Question Answering on MVBench

PWC

Video-based Generative Performance Benchmarking

PWC


:loudspeaker: Latest Updates


VideoGPT+ Overview :bulb:

VideoGPT+ integrates image and video encoders to leverage detailed spatial understanding and global temporal context, respectively. It processes videos in segments using adaptive pooling on features from both encoders, enhancing performance across various video benchmarks.

<p align="center"> <img src="docs/images/block_diagram.png" alt="VideoGPT+ Architectural Overview"> </p>

Contributions :trophy:

<p align="center"> <img src="docs/images/intro_radar_plot.png" alt="Contributions" width="650"> </p>

Video Annotation Pipeline (VCG+ 112K) :open_file_folder:

Video-ChatGPT introduces the VideoInstruct100K dataset, which employs a semi-automatic annotation pipeline to generate 75K instruction-tuning QA pairs. To address the limitations of this annotation process, we present \ourdata~dataset developed through an improved annotation pipeline. Our approach improves the accuracy and quality of instruction tuning pairs by improving keyframe extraction, leveraging SoTA large multimodal models (LMMs) for detailed descriptions, and refining the instruction generation strategy.

<p align="center"> <img src="docs/images/vcg120k_block_diagram.png" alt="Contributions"> </p>

VCGBench-Diverse :mag:

Recognizing the limited diversity in existing video conversation benchmarks, we introduce VCGBench-Diverse to comprehensively evaluate the generalization ability of video LMMs. While VCG-Bench provides an extensive evaluation protocol, it is limited to videos from the ActivityNet200 dataset. Our benchmark comprises a total of 877 videos, 18 broad video categories and 4,354 QA pairs, ensuring a robust evaluation framework.

<p align="center"> <img src="docs/images/vcgbench_block_diag.png" alt="Contributions"> </p>

Installation :wrench:

We recommend setting up a conda environment for the project:

conda create --name=videogpt_plus python=3.11
conda activate videogpt_plus

git clone https://github.com/mbzuai-oryx/VideoGPT-plus
cd VideoGPT-plus

pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.41.0

pip install -r requirements.txt

export PYTHONPATH="./:$PYTHONPATH"

Additionally, install FlashAttention for training,

pip install ninja

git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
python setup.py install

Quantitative Evaluation 📊

We provide instructions to reproduce VideoGPT+ results on VCGBench, VCGBench-Diverse and MVBench. Please follow the instructions at eval/README.md.

VCGBench Evaluation: Video-based Generative Performance Benchmarking :chart_with_upwards_trend:

<p align="center"> <img src="docs/images/VCGBench_quantitative.png" alt="VCGBench_quantitative" width="1000"> </p>

VCGBench-Diverse Evaluation :bar_chart:

<p align="center"> <img src="docs/images/VCGDiverse_quantitative.png" alt="VCGDiverse_quantitative"> </p>

Zero-Shot Question-Answer Evaluation :question:

<p align="center"> <img src="docs/images/zero_shot_quantitative.png" alt="zero_shot_quantitative"> </p>

MVBench Evaluation :movie_camera:

<p align="center"> <img src="docs/images/MVBench_quantitative.png" alt="MVBench_quantitative"> </p>

Training :train:

We provide scripts for pretraining and finetuning of VideoGPT+. Please follow the instructions at scripts/README.md.


Qualitative Analysis :mag:

A comprehensive evaluation of VideoGPT+ performance across multiple tasks and domains.

<p align="center"> <img src="docs/images/demo_vcg+_main.png" alt="demo_vcg+_main" width="700"> </p>
<p align="center"> <img src="docs/images/demo_vcg+_full_part1.jpg" alt="demo_vcg+_full_part1" width="700"> </p> <p align="center"> <img src="docs/images/demo_vcg+_full_part2.jpg" alt="demo_vcg+_full_part2" width="700"> </p>

Acknowledgements :pray:

Citations 📜:

If you're using VideoGPT+ in your research or applications, please cite using this BibTeX:

@article{Maaz2024VideoGPT+,
    title={VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding},
    author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
    journal={arxiv},
    year={2024},
    url={https://arxiv.org/abs/2406.09418}
}

@inproceedings{Maaz2023VideoChatGPT,
    title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},
    author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
    booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)},
    year={2024}
}

License :scroll:

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

Looking forward to your feedback, contributions, and stars! :star2: Please raise any issues or questions here.


<img src="docs/images/IVAL_logo.png" width="200" height="100"> <img src="docs/images/Oryx_logo.png" width="100" height="100"> <img src="docs/images/MBZUAI_logo.png" width="360" height="85">