Home

Awesome

Oryx Video-ChatGPT :movie_camera: :speech_balloon:

<p align="center"> <img src="https://i.imgur.com/waxVImv.png" alt="Oryx Video-ChatGPT"> </p>

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [ACL 2024 🔥]

Muhammad Maaz* , Hanoona Rasheed* , Salman Khan and Fahad Khan

* Equally contributing first authors

Mohamed bin Zayed University of Artificial Intelligence


Diverse Video-based Generative Performance Benchmarking (VCGBench-Diverse)

PWC

Video-based Generative Performance Benchmarking

PWC

Zeroshot Question-Answer Evaluation

PWC PWC PWC PWC


DemoPaperDemo ClipsOffline DemoTrainingVideo Instruction DataQuantitative EvaluationQualitative Analysis
Demo YouTubepaperDemoClip-1 DemoClip-2 DemoClip-3 DemoClip-4Offline DemoTrainingVideo Instruction DatasetQuantitative EvaluationQualitative Analysis

:loudspeaker: Latest Updates



Online Demo :computer:

:fire::fire: You can try our demo using the provided examples or by uploading your own videos HERE. :fire::fire:

:fire::fire: Or click the image to try the demo! :fire::fire: demo You can access all the videos we demonstrate on here.


Video-ChatGPT Overview :bulb:

Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation.

<p align="center"> <img src="docs/images/Video-ChatGPT.gif" alt="Video-ChatGPT Architectural Overview"> </p>

Contributions :trophy:

<p align="center"> <img src="docs/images/hightlights_video_chatgpt.png" alt="Contributions"> </p>

Installation :wrench:

We recommend setting up a conda environment for the project:

conda create --name=video_chatgpt python=3.10
conda activate video_chatgpt

git clone https://github.com/mbzuai-oryx/Video-ChatGPT.git
cd Video-ChatGPT
pip install -r requirements.txt

export PYTHONPATH="./:$PYTHONPATH"

Additionally, install FlashAttention for training,

pip install ninja

git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
git checkout v1.0.7
python setup.py install

Running Demo Offline :cd:

To run the demo offline, please refer to the instructions in offline_demo.md.


Training :train:

For training instructions, check out train_video_chatgpt.md.


Video Instruction Dataset :open_file_folder:

We are releasing our 100,000 high-quality video instruction dataset that was used for training our Video-ChatGPT model. You can download the dataset from here. More details on our human-assisted and semi-automatic annotation framework for generating the data are available at VideoInstructionDataset.md.


Quantitative Evaluation :bar_chart:

Our paper introduces a new Quantitative Evaluation Framework for Video-based Conversational Models. To explore our benchmarks and understand the framework in greater detail, please visit our dedicated website: https://mbzuai-oryx.github.io/Video-ChatGPT.

For detailed instructions on performing quantitative evaluation, please refer to QuantitativeEvaluation.md.

Video-based Generative Performance Benchmarking and Zero-Shot Question-Answer Evaluation tables are provided for a detailed performance overview.

Zero-Shot Question-Answer Evaluation

ModelMSVD-QAMSRVTT-QATGIF-QAActivity Net-QA
AccuracyScoreAccuracyScoreAccuracyScoreAccuracyScore
FrozenBiLM32.2--16.8--41.0--24.7--
Video Chat56.32.845.02.534.42.326.52.2
LLaMA Adapter54.93.143.82.7--34.22.7
Video LLaMA51.62.529.61.8--12.41.1
Video-ChatGPT64.93.349.32.851.43.035.22.7

Video-based Generative Performance Benchmarking

Evaluation AspectVideo ChatLLaMA AdapterVideo LLaMAVideo-ChatGPT
Correctness of Information2.232.031.962.40
Detail Orientation2.502.322.182.52
Contextual Understanding2.532.302.162.62
Temporal Understanding1.941.981.821.98
Consistency2.242.151.792.37

Qualitative Analysis :mag:

A Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.

Video Reasoning Tasks :movie_camera:

sample1


Creative and Generative Tasks :paintbrush:

sample5


Spatial Understanding :globe_with_meridians:

sample8


Video Understanding and Conversational Tasks :speech_balloon:

sample10


Action Recognition :runner:

sample22


Question Answering Tasks :question:

sample14


Temporal Understanding :hourglass_flowing_sand:

sample18


Acknowledgements :pray:

If you're using Video-ChatGPT in your research or applications, please cite using this BibTeX:

@inproceedings{Maaz2023VideoChatGPT,
    title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},
    author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
    booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)},
    year={2024}
}

License :scroll:

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

Looking forward to your feedback, contributions, and stars! :star2: Please raise any issues or questions here.


<img src="docs/images/IVAL_logo.png" width="200" height="100"> <img src="docs/images/Oryx_logo.png" width="100" height="100"> <img src="docs/images/MBZUAI_logo.png" width="360" height="85">