Awesome

<h2 align="center">MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies</h2> <a href="https://github.com/Deaddawn">Zhende Song</a> · <a href="https://github.com/doctorlightt">Chenchen Wang</a> · <a href="https://github.com/sjmFDU">Jiamu Sheng</a> · <a href="https://icoz69.github.io/">Chi Zhang†</a> · <a href="https://scholar.google.com/citations?hl=zh-CN&user=gsLd2ccAAAAJ">Jiayuan Fan✦</a> · <a href="https://eetchen.github.io/">Tao Chen</a> ( † Project Leader, ✦ Corresponding Author ) From Fudan University and Tencent PCG <a href="https://arxiv.org/abs/2403.01422"> <img src='https://img.shields.io/badge/arxiv-MovieLLM-b31b1b.svg' alt='Paper PDF'></a> <a href="https://deaddawn.github.io/MovieLLM/"> <img src='https://img.shields.io/badge/Project-Website-green' alt='Project Page'></a> <image src="docs/fig1.png" /> We propose MovieLLM, a novel framework designed to create synthetic, high-quality data for long videos. This framework leverages the power of GPT-4 and text-to-image models to generate detailed scripts and corresponding visuals.

Changelog

[2024.03.03]: Release inference code, evaluation code and model weights.
[2024.03.13]: Release raw data, check it out here
[2024.07.02]: All generation code will be released after the work is accepted.

Summary

This repository is mainly used for these purposes: data generation code, training code, video evaluation code. We build this repo based on LLaMA-VID. We plan to first release our model, inference and evaluation code and then the rest.
For a better understanding of our training and evaluation process, we suggest running through codes from LLaMA-VID first.

Install
Model
Preparation
MovieLLM pipeline
Inference
Evaluation
Results
Citation
Acknowledgement

Install

Please follow the instructions below to install the required packages. Our training process is mainly based on LLaMA-VID. And our short video evaluation process is mainly based on quantitative_evaluation from Video-ChatGPT.

Clone this repository

git clone https://github.com/Deaddawn/MovieLLM-code.git

Clone LLaMA-VID repository

cd MovieLLM-code
git clone https://github.com/dvlab-research/LLaMA-VID.git
mv eval_movie_qa.py calculate.py LLaMA-VID
mv run_llamavid_movie_answer.py LLaMA-VID/llamavid/serve

Install Package

conda create -n MovieLLM python=3.10 -y
conda activate MovieLLM
cd LLaMA-VID
pip install -e .

Install additional packages for video training

pip install ninja
pip install flash-attn --no-build-isolation

Model

We provide our baseline model and model trained on our generated dataset. All models are trained on stage3 of LLaMA-VID. For more detailed information, please refer to LLaMA-VID-model

Type	Max Token	Base LLM	Finetuning Data	Finetuning schedule	Download
Long video	64K	Vicuna-7B-v1.5	LLaVA1.5-VideoChatGPT-Instruct + LongVideoQA	full_ft-1e	ckpt
Long video	64K	Vicuna-7B-v1.5	LLaVA1.5-VideoChatGPT-Instruct + LongVideoQA+MovieLLMQA	full_ft-1e	ckpt

Preparation

This section is mainly used to demonstrate how to set up the data and model environment related to llamavid. Again, we suggest running through from the original LLaMA-VID-preparation. We write this section based on the above with some alteration.

Dataset

We provide raw dataset generated from our pipeline and also related training data based on LLaMA-VID.

Our Raw Data

Data generated from our pipeline consists of key frame images, corresponding QAs and dialogues. You can download it from here MovieLLM-Data <image src="docs/tuning_data_distribution.png" />

Training Data

To run training process on LLaMA-VID stage-3, processed video data and corresponding QA pairs are needed:

(1) Processed Video Data

We first preprocess the raw data from MovieNet (used in LLaMA-VID original paper) and the raw data generated from our pipeline.

For data preprocessing from MovieNet, please first download the long video data from MovieNet, shot detection results from here. Place shot detection results under LLaMA-VID-Finetune/movienet/files before preprocessing. Then please follow the preprocess-instruct to preprocess your data

For processed data from ours, please download it from here MovieLLM-feat (coming soon).

(2) Corresponding QA Pairs

For correspongding QA pairs, please download it from here:

Data file name	Size
long_videoqa_base.json	240MB
long_videoqa_ours.json	245MB

Pretrained Weights

Please download the pretrained weights from the following link EVA-ViT-G, QFormer-7b

Structure

Please organize the video data, QA pairs and weights as the following structure:

LLaMA-VID
├── llamavid
├── scripts
├── work_dirs
│   ├── llama-vid
│   │   ├── llama-vid-7b-full-224-long-video-MovieLLM
│   │   ├── llama-vid-7b-full-224-long-video-baseline
├── model_zoo
│   ├── LAVIS
│   │   ├── eva_vit_g.pth
│   │   ├── instruct_blip_vicuna7b_trimmed.pth
├── data
│   ├── LLaMA-VID-Finetune
│   │   ├── long_videoqa_base.json
│   │   ├── long_videoqa_ours.json
│   │   ├── movienet
│   │   ├── story_feat
│   ├── LLaMA-VID-Eval
│   │   ├── MSRVTT-QA
│   │   ├── MSVD-QA
│   │   ├── video-chatgpt

Pipeline

Coming soon. <image src="docs/PIPELINE.png" />

Training

Coming soon.

Inference

For long-video inference on LLaMA-VID, please follow LLaMA-VID-Long-video-preprocess to process your video. Then, please try this for long video inference:

cd LLaMA-VID
python llamavid/serve/run_llamavid_movie.py \
    --model-path work_dirs/llama-vid/llama-vid-7b-full-224-long-video \
    --video-file <path to your processed video file> \
    --load-4bit

Evaluation

We perform evaluation on both short video and long video.

Short video

For short video evaluation, please download the evaluation data following Preparation and organize them as in Structure.

Results for short video

Model	MSVD-QA	MSVD-QA Score	MSRVTT-QA	MSRVTT-QA Score	Correctness	Detail	Context	Temporal	Consistency
Baseline	49.3	3.169	43.5	2.865	1.94	2.431	2.701	1.585	1.699
Ours	56.7	3.46	51.3	3.141	2.154	2.549	2.88	1.832	1.976

For MSVD-QA evaluation:

bash scripts/video/eval/msvd_eval.sh

For MSRVTT-QA evaluation:

bash scripts/video/eval/msrvtt_eval.sh

Long video

To run long video evaluation, please first download corresponding test-data and QAs.

Then run the following to generate answers for two models (our evaluation methods compare two answers based on reference answer)

python llamavid/serve/run_llamavid_movie_answer.py --model-path <your-model-path> --video-file <test-data-path> --output_path <path-for-saving-answers> --load-4bit --meta_path <QA-path>

Note that in this paper, we run the above for both baseline model and models trained on our data. So, basically, you should have two folders for answers of both models.

Now, you should have following three folders for ground truth, prediction from model 1, prediction from model 2 like the following:

res
|-- baseline
|-- ground_truth
|-- ours

Then run

 python eval_movie_qa.py --output_dir ./test/compare_res --api_key <your-api-key> --gt_dir ./res/ground_truth --method_dir ./res/ours --base_dir ./res/basline

Finally

python calculate.py --path ./test/compare_res

Results for long video

Results

Generation Results

Comparison Results

Citation

If you find our work useful, please consider citing:

@misc{song2024moviellm,
      title={MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies}, 
      author={Zhende Song and Chenchen Wang and Jiamu Sheng and Chi Zhang and Gang Yu and Jiayuan Fan and Tao Chen},
      year={2024},
      eprint={2403.01422},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

We would like to thank the following repos for their great work:

Our experiment is conducted based on LLaMA-VID.
We perform short video evaluation based on Video-ChatGPT.
We build our pipeline based on textual-inversion