Awesome
<h1 align="center"> <img src="https://s21.ax1x.com/2024/11/25/pAhrS9s.png" width="220"/> <br> T2Vid: Efficient Video Fine-tuning Scheme for MLLMs </h1> <p align="center">    📑 <a href="https://arxiv.org/pdf/2411.19951">Paper</a>    </a> |    🤗 <a href="https://huggingface.co/collections/xjtupanda/t2vid-673f104cdaf4ac3340b15964">Hugging Face</a>   </p>TL;DR: We proposed a data augmentation method (synthesizing "video" samples from long QA text data) to enrich the instruction diversity of video data, which facilitates more efficient training with comparable performance.
✨ Highlights
🤔 Main findings: The importance of instruction diversity in video fine-tuning and how to efficiently improve it.
- We observed a limited instruction diversity in datasets developed for Video-LLMs, which led to low learning efficiency (<ins>More details and findings are available in our paper</ins>).
- Since text data could be a rich and economical source, we leveraged these data in a format that was more consistent with video instruction data.
🚀 Train less, achieve more: By mixing in our synthetic data, one can achieve comparable or better performance, while the total training sample size is only 15%.
Video-MME | MVBench | TempCompass | |
---|---|---|---|
MiniCPM-V-2.5-8B<br><sub>zero-shot</sub> | 48.2 | 42.9 | 49.1 |
MiniCPM-V-2.5-8B<br><sub>200K video data</sub> | 50.8 | 48.0 | 54.7 |
MiniCPM-V-2.5-8B<br><sub>20K video data + 10K synthetic data</sub> | 53.0 | 48.4 | 56.8 |
Idefics3-8B<br><sub>zero-shot</sub> | 51.2 | 49.6 | 55.9 |
Idefics3-8B<br><sub>200K video data</sub> | 53.3 | 50.7 | 62.9 |
Idefics3-8B<br><sub>20K video data + 10K synthetic data</sub> | 56.3 | 51.6 | 62.3 |
🛠️ Quick Setup
- Create a conda virtual environment and install the required packages.
conda create -n t2vid python=3.9
conda activate t2vid
pip install -r requirements.txt
- Install Flash Attention 2 (for efficient training and inference).
pip install -U flash-attn --no-build-isolation
💡 Training & Evaluation
The instructions on training and evaluation (including pre-trained weights) are in TRAIN.md and EVAL.md.
📖 Misc
For those interested in the implementation details of our paper:
- How to translate text into images? Check t2vid.py.
- How to visualize the distribution of instructions?
- Calculate embeddings and perform dimensionality reduction for instructions: calc_inst_embeddings.py.
- Draw plots: vis-tsne.ipynb.
🙌 Related Projects
- Video-MME: A comprehensive video benchmark that we mainly use in our study.
- Awesome-MLLM: A project keeping track of new papers and the latest developments in the field of MLLMs.
🌻 Acknowledgement
- Great open-sourced MLLMs and code: MiniCPM-V, Idefics3, InternVL.
- Long text instruction data: LongAlpaca and LongQLoRA.
🖋️ Citation
If you find our project useful, please consider citing our paper:
@article{yin2024t2vid,
title={T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs},
author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Shen, Yunhang and Ge, Chunjiang and Yang, Yan and Long, Zuwei and Dai, Yuhan and Xu, Tong and Sun, Xing and others},
journal={arXiv preprint arXiv:2411.19951},
year={2024}
}