Home

Awesome

<img src="src/assets/logo.png" height="120px" align="left">

MovieChat

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang✉️
CVPR 2024.

MovieChat can handle videos with >10K frames on a 24GB graphics card. MovieChat has a 10000× advantage over other methods in terms of the average increase in GPU memory cost per frame (21.3KB/f to ~200MB/f).

<p align="center" width="100%"> <a target="_blank"><img src="src/assets/wave.gif" alt="MovieChat" style="width: 80%; min-width: 200px; display: block; margin: auto;"></a> </p> <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update.</h5>

:fire: News

PWC
PWC
PWC PWC PWC

✨How to run MovieChat quickly?

We have packaged MovieChat and uploaded it to PyPI. To run MovieChat quickly, you need to install it firstly.

pip install MovieChat

We advise you to install version 0.6.3 for now. Since MovieChat will download checkpoints from Huggingface automatically, if your service doesn't support git clone from <HuggingFace url>, we recommend you to download the checkpoint to your service, and change the respective path in the package, including q_former_model, ckpt_path, and llama_model.

Before you run the following inference code, we hope you can verify the installation of ffprobe via ffprobe -version. This command should return the version of ffprobe if it is correctly installed. Otherwise, you should install it via sudo apt-get install ffmpeg (Ubuntu).

from PIL import Image
import cv2

from MovieChat.processors.video_processor import AlproVideoEvalProcessor
from MovieChat.models.chat_model import Chat
from MovieChat.models.moviechat import MovieChat

device = 'cuda:0'
print('Initializing Chat')
moviechat_model = MovieChat.from_config(device=device).to(device)
vis_processor_cfg = {'name': 'alpro_video_eval', 'n_frms': 8, 'image_size': 224}
frame_processor = AlproVideoEvalProcessor.from_config(vis_processor_cfg)
chat = Chat(moviechat_model, frame_processor, device=device)
print('Initialization Finished')

video_path = "Your video path, end with mp4"
fragment_video_path = "The path to store tmp video clips"
middle_video = False # True->Breakpoint mode, False->Global mode
question = "Your Question"
cur_min = 0 # Change it when Breakpoint mode
cur_sec = 0 # Change it when Breakpoint mode

cap = cv2.VideoCapture(video_path)
cur_fps = cap.get(cv2.CAP_PROP_FPS)
cap.set(cv2.CAP_PROP_POS_FRAMES, cur_fps)
ret, frame = cap.read()
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
pil_image = Image.fromarray(rgb_frame)
image = chat.image_vis_processor(pil_image).unsqueeze(0).unsqueeze(2).half().to(device)
cur_image = chat.model.encode_image(image)

img_list = []
msg = chat.upload_video_without_audio(
    video_path=video_path, 
    fragment_video_path=fragment_video_path,
    cur_min=cur_min, 
    cur_sec=cur_sec, 
    cur_image=cur_image, 
    img_list=img_list, 
    middle_video=middle_video,
    question=question
)
answer = chat.answer(
    img_list=img_list,
    input_text=question,
    msg = msg,
    num_beams=1,
    temperature=1.0,
    max_new_tokens=300,
    max_length=2000)[0]

print(answer)

Note that if you receive a RuntimeError like "Error reading <filename.mp4>", one solution is to initialize <filename.mp4> with any other video file.

💡 Overview

📣 Demo Video

Alt text

⚡ Comparison Case

<div style="color:orange; border-bottom: 1px solid #d9d9d9; display: inline-block; color: #999; padding: 2px;"> Question and answer about a clip from YouTube, which is a tutorial on how to cook steak. The entire instructional process begins with marinating the steak, followed by pan-searing it, preparing side dishes, and ultimately plating the meal. Green ( Red ) highlights the correct (wrong) answer and yellow indicates that the model is hallucinating. </div> <p align="center" width="100%"> <a target="_blank"><img src="src/compare_case.png" style="width: 80%; min-width: 200px; display: block; margin: auto;"></a> </p>

😍 Examples

<div style="color:orange; border-bottom: 1px solid #d9d9d9; display: inline-block; color: #999; padding: 2px;"> Question and answer about clips from Zootopia, a cartoon, which tells the story of a determined police officer rabbit named Judy who pairs up with a cunning fox to uncover a conspiracy about missing animals and develop an unexpected friendship. </div> <p align="center" width="100%"> <a target="_blank"><img src="src/example1_00.png" style="width: 80%; min-width: 200px; display: block; margin: auto;"></a> </p> <div style="color:orange; border-bottom: 1px solid #d9d9d9; display: inline-block; color: #999; padding: 2px;"> Question and answer about clips from Goblin, which tells the story of Kim Shin, an immortal ”goblin” who needs to find a human bride to end his endless life but instead meets Ji Eun-tak, a girl fated to die who claims to be the ”goblin’s bride,” leading to a romantic tale unfolding bet. </div> <p align="center" width="100%"> <a target="_blank"><img src="src/example2_00.png" style="width: 80%; min-width: 200px; display: block; margin: auto;"></a> </p> <div style="color:orange; border-bottom: 1px solid #d9d9d9; display: inline-block; color: #999; padding: 2px;"> Question and answer about clips from Game of Thrones, which tells the epic fantasy tale of power struggles and political intrigue among the Seven Kingdoms, entwined with intricate family relationships, all set against the backdrop of an ancient, mystical threat. </div> <p align="center" width="100%"> <a target="_blank"><img src="src/example3_00.png" style="width: 80%; min-width: 200px; display: block; margin: auto;"></a> </p> <div style="color:orange; border-bottom: 1px solid #d9d9d9; display: inline-block; color: #999; padding: 2px;"> Question and answer about clips from YouTube, which contains a compilation of some inspirational movies scenes. This video clip comprises several segments from The Death Crawl, Coach Carter, Rocky Balboa, and We Are Marshall, which vary in duration. </div> <p align="center" width="100%"> <a target="_blank"><img src="src/example4_00.png" style="width: 80%; min-width: 200px; display: block; margin: auto;"></a> </p>

🚀 Benchmark: MovieChat-1K

To better evaluate the performance of MovieChat, we collect a new benchmark for long video understanding tasks, MovieChat-1K, which contains 1K high quality video clips sourced from various movies and TV series with 14K manual annotations.

To the best of our knowledge, a long video understanding dataset has not yet been established. Our work represents the initial step in creating and making it publicly available.We create MovieChat1K, containing 1k long videos and corresponding 1k dense captions, and 13k visual question-answer pairs.For each video, we manually set and provide 1 dense caption for the whole video, 3 question-answering pairs for global mode and 10 question-answering pairs with timestamps for breakpoint mode.

<p align="center" width="100%"> <a target="_blank"><img src="src/benchmark/dataset1.png" style="width: 100%; min-width: 200px; display: block; margin: auto;"></a>

We collect videos from 15 popular categories with varying distribution, including documentary film, detective film, animation film, and so on. Among these, each video comprises multiple alternating scenes, contributing to a diverse and dynamic visual narrative within the context of the collection. Over 90% of the videos exhibit a duration ranging from 10K to 12K frames, while 14.6% of videos extending beyond 12K frames. Only 8.6% of videos have duration less than 10k frames.

Question-answering Pairs

Word Distribution

Note that MovieChat-1K is specifically designed for long video comprehension tasks, the majority of questions are open-ended, with only a quarter classified as multiple-choice questions, marked by initiators such as ‘Do,’ ‘Does,’ ‘Is,’ or ‘Are.’ We also compute the word distributions of our provided question-answer pairs, which includes common objects (people, clothes, etc.), time (day, night, etc.), scenes (indoor, outdoor, etc.), and so on.

<p align="center" width="100%"> <a target="_blank"><img src="src/benchmark/wordcloud.png" style="width: 40%; min-width: 200px; display: block; margin: auto;"></a>

Sentence length distribution

MovieChat1K exhibits diverse lengths of question-answer pairs in the segmented clip level. Despite the distribution of questionanswer pairs varies between the global mode and breakpoint mode, the majority of questions tends to concentrate between 5-15 words in length, while the length of answers generally have fewer than 10 words.

<p align="center" width="100%"> <a target="_blank"><img src="src/benchmark/length.png" style="width: 70%; min-width: 200px; display: block; margin: auto;"></a>

Dense Captions

To facilitate a more detailed understanding of long videos, we provide a dense caption for each video. MovieChat-1K exhibits diverse caption lengths in the segmented clip level. Approximately two-thirds of the clips have captions with 100-149 words, while one-fifth of the clip captions have fewer than 100 words. About 11% of clips have long captions with more than 150 words.

<p align="center" width="100%"> <a target="_blank"><img src="src/benchmark/caption_dis.png" style="width: 40%; min-width: 200px; display: block; margin: auto;"></a>

To analyze the word distribution of our generated captions, we compute their distributions. The resulting word distribution of the captions is presented in Fig. B6, which includes common objects (man, woman, people, girl, etc.), attributes (detective, various, small, white, etc.), locations (inside, behind, south, next, etc.), scenes (room, house, building, office, etc.), actions/events (talk, enter, leave, take, etc.), and more.

<p align="center" width="100%"> <a target="_blank"><img src="src/benchmark/caption_wordcloud.png" style="width: 45%; min-width: 200px; display: block; margin: auto;"></a>

In terms of actionness, MovieChat-1K captions contains nearly the same number of verbs as with the WebVid10M dataset. To evaluate this, we use the NLTK toolkit to analyze the number of verbs in captions, focusing on extracting and tagging all unique verbs. We find a total of 109,485 verbs in the WebVid10M caption dataset, while the MovieChat-1K captions contain 102,988 unique instances of verbs. While these counts may not be entirely accurate due to our simple counting method, we believe they provide a rough indication of the actionness of the two datasets.

<!-- ## Comparison between MovieChat-1K and other benchmarks MovieChat-1K provides a large-scale benchmark for long video understanding, which contains 1K movies, 1K dense captions and 13k question-answer pairs. The comparison between different datasets are shown in Tab. 8. It is evident that MovieChat-1K provides the longest average duration for movie clips. MovieQA exclusively offers question-answer pairs related to movies, while MovieGraphs supplies captions associated with movies. Unlike other datasets, MovieNet encompasses three main types of texts: subtitle, synopsis, and script, excluding question-answer pairs. Additionally, the synopsis category is designed for the entire movie rather than video clips. Consequently, MovieChat-1K is more suitable for studying long video comprehension compared to other datasets. <div align="center"> <table border="1" width="100%"> <tr align="center"> <th>Dataset</th><th>Avg. Duration (min)</th><th>Number of Captions</th><th>Avg. Caption Length</th><th>Number of Question-Answer Pairs</th><th>Avg. Question Length</th><th>Avg. Answer Length</th> </tr> <tr align="center"> <td><a href="https://arxiv.org/abs/1512.02902">MovieQA</a></td><td>3.5</td><td>-</td><td>-</td><td>14.9K</td><td>9.3</td><td>5.1</td> </tr> </tr> <tr align="center"> <td><a href="https://arxiv.org/abs/1712.06761">MovieGraphs</a></td><td>0.73</td><td>15K</td><td>35</td><td>-</td><td>-</td><td>-</td> </tr> <tr align="center"> <td><a href="https://arxiv.org/abs/2007.10937">MovieNet</a></td><td>2.1</td><td>2.5K</td><td>-</td><td>-</td><td>-</td><td>-</td> </tr> <tr align="center"> <td>MovieChat-1K</td><td>9.4</td><td>1K</td><td>121</td><td>13K</td><td>7.8</td><td>2.3</td> </tr> </table> </div> -->

🔐 © Due to the copyright concers and the size limitations of the movies, we plan to release the features of the dataset. Please wait for a few weeks.

🛠️ Install

Environment Preparation

First, ceate a conda environment:

conda env create -f environment.yml
conda activate moviechat

Prerequisites

Before using the repository, make sure you have obtained the following checkpoints:

Pre-trained Language Decoder

python apply_delta.py \
    --base ckpt/LLaMA/7B_hf \
    --target ckpt/Vicuna/7B \
    --delta ckpt/Vicuna/vicuna-7b-delta-v0 \

Pre-trained Visual Encoder for MovieChat

Download Pretrained Weights

🤖 How to Run Demo Locally

Firstly, set the llama_model, llama_proj_model and ckpt in eval_configs/MovieChat.yaml. Then run the script:

python inference.py \
    --cfg-path eval_configs/MovieChat.yaml \
    --gpu-id 0 \
    --num-beams 1 \
    --temperature 1.0 \
    --text-query "What is he doing?" \
    --video-path src/examples/Cooking_cake.mp4 \
    --fragment-video-path src/video_fragment/output.mp4 \
    --cur-min 1 \
    --cur-sec 1 \
    --middle-video 1 \

Note that, if you want to use the global mode (understanding and question-answering for the whole video), remember to change middle-video into 0.

<!-- ## 👍 Main Results ### Short video question-answering We use several widely used open-ended datasets: MSVD-QA, MSRVTT-QA, and ActivityNet-QA for short video question-answering tasks. The evaluation process is under the assistance of LLM with the default hyper-parameter settings. The accuracy and relative scores on a scale of 0 to 5 are reported. Compared to previous methods, MovieChat achieves comparable performance even it is not specifically designed for short video question-answering tasks, <div align="center"> <table border="1" width="100%"> <tr align="center"> <th>Methods</th><th>LLM</th><th>Conversation</th><th>Detail Description</th><th>Complex Reasoning</th><th>All</th> </tr> <tr align="center"> <td><a href="https://huggingface.co/Chat-UniVi/Chat-UniVi">Chat-UniVi-7B</a></td><td><a href="https://huggingface.co/lmsys/vicuna-7b-v1.5">Vicuna-7B</a></td><td><b>84.1</b></td><td>74.2</td><td>93.7</td><td>84.2</td> </tr> </tr> <tr align="center"> <td><a href="https://huggingface.co/Chat-UniVi/Chat-UniVi-13B">Chat-UniVi-13B</a></td><td><a href="https://huggingface.co/lmsys/vicuna-13b-v1.5">Vicuna-13B</a></td><td><b>84.1</b></td><td><b>79.4</b></td><td><b>94.7</b></td><td><b>86.1</b></td> </tr> </table> </div> -->

🤝 Acknowledgement

We are grateful for the following awesome projects our MovieChat arising from:

🔒 Term of Use

Our MovieChat is just a research preview intended for non-commercial use only. You must NOT use our MovieChat for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines.

✏️ Citation

If you find MovieChat useful for your your research and applications, please cite using this BibTeX:

@article{song2023moviechat,
  title={MovieChat: From Dense Token to Sparse Memory for Long Video Understanding},
  author={Song, Enxin and Chai, Wenhao and Wang, Guanhong and Zhang, Yucheng and Zhou, Haoyang and Wu, Feiyang and Guo, Xun and Ye, Tian and Lu, Yan and Hwang, Jenq-Neng and others},
  journal={arXiv preprint arXiv:2307.16449},
  year={2023}
}