Awesome

<img width="400" src=".github/wordcloud.jpg"> <h1 align="center">E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding</h1> <a href="https://arxiv.org/abs/2409.18111"><img src="https://img.shields.io/badge/arXiv-2409.18111-red"></a> <a href="https://polyu-chenlab.github.io/etbench"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a> <a href="https://huggingface.co/datasets/PolyU-ChenLab/ETBench"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-E.T.%20Bench-blue"></a> <a href="https://huggingface.co/datasets/PolyU-ChenLab/ET-Instruct-164K"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-E.T.%20Instruct%20164K-orange"></a> <a href="/LICENSE"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a> <a href="https://yeliu.dev/">Ye Liu</a>1,2, <a href="https://scholar.google.com/citations?user=qtdueToAAAAJ">Zongyang Ma</a>2,3, <a href="https://scholar.google.com/citations?user=zJvrrusAAAAJ">Zhongang Qi</a>2, <a href="https://scholar.google.com/citations?user=T-HaQ84AAAAJ">Yang Wu</a>4, <a href="https://scholar.google.com/citations?user=4oXBp9UAAAAJ">Ying Shan</a>2, <a href="https://web.comp.polyu.edu.hk/chencw/">Chang Wen Chen</a>1 1The Hong Kong Polytechnic University 2ARC Lab, Tencent PCG 3Institute of Automation, Chinese Academy of Sciences 4Tencent AI Lab

E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark) is a comprehensive solution for open-ended event-level video-language understanding. This project consists of the following three contributions:

E.T. Bench: A large-scale and high-quality benchmark for event-level and time-sensitive video understanding, comprising 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains.
E.T. Chat: A multi-modal large language model (MLLM) that specializes in time-sensitive video-conditioned chatting. It reformulates timestamp prediction as a novel embedding matching problem.
E.T. Instruct 164K: A meticulously collected instruction-tuning dataset tailored for time-sensitive video understanding scenarios.

We focus on 4 essential capabilities for time-sensitive video understanding: referring, grounding, dense captioning, and complex understanding. The examples (categorized by background colors) are as follows.

🔥 News

2024.09.28 ⭐️ Code, model, and dataset release.
2024.09.27 🎉 E.T. Bench has been accepted to NeurIPS 2024 (Datasets and Benchmarks Track).

🏆 Leaderboard

Our online leaderboard is under construction. Stay tuned!

🔮 Benchmark

Please refer to the Benchmark page for details about E.T. Bench.

🛠️ Model

Please refer to the Model page for training and testing E.T. Chat.

📦 Dataset

Please refer to the Dataset page for downloading E.T. Instruct 164K.

📖 Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2024etbench,
  title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding},
  author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying},
  booktitle={Neural Information Processing Systems (NeurIPS)},
  year={2024}
}

💡 Acknowledgements

This project was built upon the following repositories with many thanks to their authors.

LLaVA, LAVIS, EVA, LLaMA-VID, TimeChat, densevid_eval