Home

Awesome

<h1 align="center">MLVU: Multi-task Long Video Understanding Benchmark</h1> <p align="center"> <a href="https://arxiv.org/abs/2406.04264"> <img alt="Build" src="http://img.shields.io/badge/cs.CV-arXiv%3A2406.04264-B31B1B.svg"> </a> <a href="https://huggingface.co/datasets/MLVU/MVLU"> <img alt="Build" src="https://img.shields.io/badge/🤗 Dataset-MLVU Benchmark (Dev)-yellow"> </a> <a href="https://huggingface.co/datasets/MLVU/MLVU_Test"> <img alt="Build" src="https://img.shields.io/badge/🤗 Dataset-MLVU Benchmark (Test)-yellow"> </a> </p> <p align="center"> <a href="https://mp.weixin.qq.com/s/7gjROX0T1MFApRB0WkDzMg"> <img alt="Build" src="https://img.shields.io/badge/BAAI-red"> </a> <a href="https://mp.weixin.qq.com/s/Z6lU37EhpJHbJHLfUCDMRg"> <img alt="Build" src="https://img.shields.io/badge/PaperWeekly-red"> </a> <a href="https://mp.weixin.qq.com/s/-HUORRvhGVDdfPcKReXsCg"> <img alt="Build" src="https://img.shields.io/badge/量子位-red"> </a> </p>

This repo contains the annotation data and evaluation code for the paper "MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding".

:bell: News:

License

Our dataset is under the CC-BY-NC-SA-4.0 license.

:warning: If you need to access and use our dataset, you must understand and agree: This dataset is for research purposes only and cannot be used for any commercial or other purposes. The user assumes all effects arising from any other use and dissemination.

We do not own the copyright of any raw video files. Currently, we provide video access to researchers under the condition of acknowledging the above license. For the video data used, we respect and acknowledge any copyrights of the video authors. Therefore, for the movies, TV series, documentaries, and cartoons used in the dataset, we have reduced the resolution, clipped the length, adjusted dimensions, etc. of the original videos to minimize the impact on the rights of the original works.

If the original authors of the related works still believe that the videos should be removed, please contact mlvubenchmark@gmail.com or directly raise an issue.

Introduction

We introduce MLVU: the first comprehensive benchmark designed for evaluating Multimodal Large Language Models (MLLMs) in Long Video Understanding (LVU) tasks. MLVU is constructed from a wide variety of long videos, with lengths ranging from 3 minutes to 2 hours, and includes nine distinct evaluation tasks. These tasks challenge MLLMs to handle different types of tasks, leveraging both global and local information from videos.

Our evaluation of 20 popular MLLMs, including GPT-4o, reveals significant challenges in LVU, with even the top-performing GPT-4o only achieving an average score of 64.6% in multi-choice tasks. In addition, our empirical results underscore the need for improvements in context length, image understanding, and strong LLM-backbones. We anticipate that MLVU will serve as a catalyst for the community to further advance MLLMs' capabilities in understanding long videos.

Statistical overview of our LVBench dataset. Left: Distribution of Video Duration; Middle Distribution of Source Types for Long Videos; Right: Quantification of Each Task Type.

:trophy: Mini-Leaderboard (MLVU Dev Set)

ModelInputSizeM-AvgG-Avg
Full mark--10010
Oryx-1.5128 frm32B72.3--
Aria256 frm25B70.65.02
LLaVA-OneVision32 frm72B66.4--
Video-XL256 frm7B64.94.50
GPT-4o0.5 fps-64.65.80
TimeMarker<=128 frm8B63.93.99
Video-CCAM96 frm14B63.14.01
VideoLLaMA216 frm72B61.2--
InternVL216 frm76B59.9--
VILA-1.514 frm40B56.74.31
LongVA256 frm7B56.34.33
InternVL-1.516 frm26B50.44.02
GPT-4 Turbo16 frm-49.25.35
VideoLLaMA2-Chat16 frm7B48.53.99
VideoChat2_HD16 frm7B47.93.99
Video-LLaVA8 frm7B47.33.84
ShareGPT4Video16 frm8B46.43.77
VideoChat2-Vicuna16 frm7B44.53.81
MiniGPT4-Video90 frm7B44.53.36
Qwen-VL-Max16 frm-42.23.96
VTimeLLM100 frm7B41.93.94
LLaVA-1.616 frm7B39.33.23
Claude-3-Opus16 frm-36.53.39
MA-LMM1000 frm7B36.43.46
Video-LLaMA-216 frm13B35.53.78
LLaMA-VID1 fps7B33.24.22
Video-ChatGPT100 frm7B31.33.90
TimeChat96 frm7B30.93.42
VideoChat16 frm7B29.23.66
Movie-LLM1 fps7B26.13.94
mPLUG-Owl-V16 frm7B25.93.84
MovieChat2048 frm7B25.82.78
Otter-V16 frm7B24.43.31
Otter-I16 frm7B23.33.15

:trophy: MLVU-Test Leaderboard

This table is sorted by M-AVG in descending order. * means the proprietary models.

InputSizeTRARNQAERPQASQAAOACTQAM-AVGSSCVSG-Avg
Aria256 frm25B86.864.175.056.666.058.348.625.046.558.5------
GPT-4o*0.5 fps--83.768.842.947.857.163.646.235.048.754.96.804.945.87
TimeMarker128 frm8B85.753.965.049.152.041.731.426.737.249.24.023.203.61
LLaVA-OneVision32 frm72B83.556.446.758.458.027.835.723.334.947.25.093.754.42
InternVL216 frm76B85.751.348.347.252.044.432.915.034.945.75.252.553.90
VideoLLaMA216 frm72B80.253.836.754.754.038.942.916.732.645.65.092.803.95
Video-XL256 frm7B78.028.250.041.546.041.648.631.744.245.55.023.404.21
VILA-1.514 frm40B84.756.438.335.862.038.834.311.734.944.25.112.533.82
TS-LLaVA50 frm34B83.543.655.032.146.055.628.610.032.643.0------
Video-CCAM96 frm14B79.138.545.052.856.033.324.326.730.242.94.492.653.57
LongVA256 frm7B81.341.046.739.646.044.417.123.330.241.14.922.903.91
InternVL-1.516 frm26B80.251.340.024.542.030.614.313.339.537.35.182.733.96
VideoChat2_HD16 frm7B74.743.635.034.030.030.621.423.323.335.15.142.833.99
VideoLLaMA2-Chat16 frm7B76.935.926.734.040.027.817.115.020.932.75.272.403.84
ShareGPT4Video16 frm8B73.625.631.745.338.038.917.18.325.633.84.722.533.63
VideoChat2-Vicuna16 frm7B72.530.818.328.326.036.117.123.318.630.14.802.303.55
Video-LLaVA8 frm7B70.338.513.326.426.038.920.021.720.930.75.062.303.68
LLaVA-1.616 frm7B63.717.913.326.430.022.221.416.716.325.34.202.003.10
Claude-3-Opus*16 frm--53.830.814.017.020.047.210.06.725.625.03.672.833.25
VideoChat16 frm7B26.412.818.317.022.011.115.711.714.016.64.902.153.53
Video-ChatGPT16 frm7B17.617.928.332.122.027.817.113.311.620.95.062.223.64
Video-LLaMA-216 frm13B52.712.813.317.012.019.415.78.318.618.94.872.233.55
Qwen-VL-Max*10 frm--75.853.815.026.438.044.420.011.722.634.24.843.003.92
MA-LMM16 frm7B44.023.113.330.214.027.818.613.314.022.04.613.043.83
MiniGPT4-Video90 frm7B64.946.220.030.230.016.715.715.018.628.64.272.503.39
Movie-LLM1 fps7B27.525.610.011.316.016.720.021.723.319.14.932.103.52
Otter-I16 frm7B17.617.916.717.018.016.715.716.714.016.73.902.032.97
Otter-V16 frm7B16.512.816.722.622.08.312.913.316.315.74.202.183.19
MovieChat2048 frm7B18.710.323.315.116.030.617.115.016.318.03.242.302.77
mPLUG-Owl-V16 frm7B25.315.46.713.222.019.414.320.018.617.25.012.203.61
LLaMA-VID1 fps7B20.923.121.711.316.016.718.615.011.617.24.152.703.43

License

Our dataset is under the CC-BY-NC-SA-4.0 license.

:warning: If you need to access and use our dataset, you must understand and agree: This dataset is for research purposes only and cannot be used for any commercial or other purposes. The user assumes all effects arising from any other use and dissemination.

We do not own the copyright of any raw video files. Currently, we provide video access to researchers under the condition of acknowledging the above license. For the video data used, we respect and acknowledge any copyrights of the video authors. Therefore, for the movies, TV series, documentaries, and cartoons used in the dataset, we have reduced the resolution, clipped the length, adjusted dimensions, etc. of the original videos to minimize the impact on the rights of the original works.

If the original authors of the related works still believe that the videos should be removed, please contact mlvubenchmark@gmail.com or directly raise an issue.

MLVU Benchmark

Before you access our dataset, we kindly ask you to thoroughly read and understand the license outlined above. If you cannot agree to these terms, we request that you refrain from downloading our video data.

The annotation file is readily accessible here. For the raw videos, you can access them via this <u>🤗 HF Link</u>.

MLVU encompasses nine distinct tasks, which include multiple-choice tasks as well as free-form generation tasks. These tasks are specifically tailored for long-form video understanding, and are classified into three categories: holistic understanding, single detail understanding, and multi-detail understanding. Examples of the tasks are displayed below.

Task Examples of our MLVU.

Evaluation

Please refer to our evaluation and evaluation_test folder for more details.

Hosting and Maintenance

The annotation files will be permanently retained.

If some videos are requested to be removed, we will replace them with a set of video frames sparsely sampled from the video and adjusted in resolution. Since all the questions in MLVU are only related to visual content and do not involve audio, this will not significantly affect the validity of MLVU (most existing MLLMs also understand videos by frame extraction).

If even retaining the frame set is not allowed, we will still keep the relevant annotation files, and replace them with the meta-information of the video, or actively seek more reliable and risk-free video sources.

Citation

If you find this repository useful, please consider giving a star :star: and citation

@article{MLVU,
  title={MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding},
  author={Zhou, Junjie and Shu, Yan and Zhao, Bo and Wu, Boya and Xiao, Shitao and Yang, Xi and Xiong, Yongping and Zhang, Bo and Huang, Tiejun and Liu, Zheng},
  journal={arXiv preprint arXiv:2406.04264},
  year={2024}
}