Awesome
LVChat
This is the official implementation of our paper LVChat: Facilitating Long Video Comprehension. Our code base is built on the repo Ask-Anything.
Environment Preparation
conda create --name lvchat python=3.11
pip install -r requirements.txt
Datasets
We used the instruction data for training. Specifically, we used the following subsets (Please refer to the link here which includes all the json file needed for training):
conversation_videochat1
conversation_videochat2
conversation_videochatgpt
caption_videochat
reasoning_clevrer_qa
reasoning_clevrer_mc
reasoning_next_qa
To replicate our training for Frame Scalable Encoding (FSE), please download the datasets Clevrer, NExT-QA, VideoChatGPT, WebVid-10M(However, this dataset is no longer available) as well as the json files from VideoChat2-IT. Then we put all the datasets as the following structure:
- data
- ANet
- activitynet_train_videos_video_chatgpt
- anno
- video
- caption
- conversation
- reasoning
- clevrer
- internvid-10s (This is the instruction dataset collected by VideoChat2. These videos are from InternVid (https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid). Considering the data is too large, you may can download the video by yourself. For example, “LLU5X98aozs_648.258.mp4”, “LLU5X98aozs”is YouTube ID, “648.258”is the start time,and the video clip duration is 10s. Thanks to the author Kunchang Li of VideoChat2 for offering the link and instructions.)
- nextqa
- WebVid10M (All the videos of VideoChat v1 data are from here)
Base model preparation
- Download the VideoBLIP model.
wget -P video_models https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/videochat2/umt_l16_qformer.pth
- Follow here to prepare vicuna-7b-v0 and place it under video_models
Training with Frame Scalable Encoding (FSE)
Download the model videochat2_7b_stage3.pth
from here then put it under the folder video_models
. Now the folder video_models
should have the following structure:
- video_models
- vicuna-7b-v0
l16_25m.pth
umt_l16_qformer.pth
videochat2_7b_stage3.pth
For Validation, please refer to the following section to download MVBench and put the dataset under the folder ./MVBench
.
Then simply run the following code (remember to set the number of gpus in the file NUM_GPUS
).
sh run_7b_stage4.sh
Evaluation
Download MVBench
Download from Hugging Face and place it under ./MVBench
. The file structure under MVBench
is:
- assert
- json
- video
.gitattributes
README.md
Prepare street-scene data(required if want to use the extended MVBench data)
bash download_street_scnene.sh
Prepare LV-Chat Model
Please download the model from LV-Chat. Put the pth file 7b_stage4.pth
under the folder video_models
.
Evaluate LV-Chat on MVBench
Run the script to test our model and the result will be written to logs
:
bash run_mvbench.sh
You can also run the baseline (VideoChat2) using:
bash run_mvbench.sh --config ./configs/config_videochat2.json
Evaluate LV-Chat on Real-world datasets
TACoS
- Download TACoS dataset from here and place the
videos
folder under./TACoS
. - Download GPT-4 generated summary:
wget -P ./TACoS https://huggingface.co/datasets/Kevin99z/tacos_summary/resolve/main/summary.json
- Evaluate TACoS
bash run_tacos.sh # add --config ./configs/config_videochat2.json to test the baseline
EgoSchema
- Download EgoSchema here and place it under
./EgoSchema
. - Evaluate EgoSchema
bash run_egoschema.sh # add --config ./configs/config_videochat2.json to test the baseline
If you find our paper or code useful, please consider citing our paper.