Home

Awesome

<h1 align="center"> <img src="https://s21.ax1x.com/2024/11/25/pAhrS9s.png" width="220"/> <br> T2Vid: Efficient Video Fine-tuning Scheme for MLLMs </h1> <p align="center"> &nbsp&nbsp 📑 <a href="https://arxiv.org/pdf/2411.19951">Paper</a> &nbsp&nbsp </a> | &nbsp&nbsp 🤗 <a href="https://huggingface.co/collections/xjtupanda/t2vid-673f104cdaf4ac3340b15964">Hugging Face</a>&nbsp&nbsp </p>

TL;DR: We proposed a data augmentation method (synthesizing "video" samples from long QA text data) to enrich the instruction diversity of video data, which facilitates more efficient training with comparable performance.

✨ Highlights

🤔 Main findings: The importance of instruction diversity in video fine-tuning and how to efficiently improve it.

<p align="center"> <img src="https://s21.ax1x.com/2024/11/25/pAhyPTU.png" width="75%" height="75%"> </p>

🚀 Train less, achieve more: By mixing in our synthetic data, one can achieve comparable or better performance, while the total training sample size is only 15%.

Video-MMEMVBenchTempCompass
MiniCPM-V-2.5-8B<br><sub>zero-shot</sub>48.242.949.1
MiniCPM-V-2.5-8B<br><sub>200K video data</sub>50.848.054.7
MiniCPM-V-2.5-8B<br><sub>20K video data + 10K synthetic data</sub>53.048.456.8
Idefics3-8B<br><sub>zero-shot</sub>51.249.655.9
Idefics3-8B<br><sub>200K video data</sub>53.350.762.9
Idefics3-8B<br><sub>20K video data + 10K synthetic data</sub>56.351.662.3

🛠️ Quick Setup

  1. Create a conda virtual environment and install the required packages.
conda create -n t2vid python=3.9
conda activate t2vid
pip install -r requirements.txt
  1. Install Flash Attention 2 (for efficient training and inference).
pip install -U flash-attn --no-build-isolation

💡 Training & Evaluation

The instructions on training and evaluation (including pre-trained weights) are in TRAIN.md and EVAL.md.

📖 Misc

For those interested in the implementation details of our paper:

🙌 Related Projects

🌻 Acknowledgement

🖋️ Citation

If you find our project useful, please consider citing our paper:

@article{yin2024t2vid,
  title={T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Shen, Yunhang and Ge, Chunjiang and Yang, Yan and Long, Zuwei and Dai, Yuhan and Xu, Tong and Sun, Xing and others},
  journal={arXiv preprint arXiv:2411.19951},
  year={2024}
}