Home

Awesome

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

This project accompanies the research paper,

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models <br> Mingze Xu*, Mingfei Gao*, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

<p align="center"> <img src="assets/teaser.png" width="600"> </p>

SlowFast-LLaVA is a training-free multimodal large language model (LLM) for video understanding and reasoning. Without requiring fine-tuning on any data, it achieves comparable or even better performance compared to state-of-the-art Video LLMs on a wide range of VideoQA tasks and benchmarks, as shown in the figure.

Table of contents

Getting Started

Installation

Data Preparation

  1. We prepare the ground-truth question and answer files based on IG-VLM, and put them under playground/gt_qa_files.

    • MSVD-QA
      • Download the MSVD_QA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_msvd_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • MSRVTT-QA
      • Download the MSRVTT_QA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_msrvtt_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • TGIF-QA
      • Download the TGIF_FrameQA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_tgif_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • Activitynet-QA
      • Download the Activitynet_QA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_activitynet_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • NExT-QA
      • Download the NExT_QA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_nextqa_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • EgoSchema
      • Download the EgoSchema.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_egoschema_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • IntentQA
      • Download the IntentQA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_intentqa_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • VCGBench
      • Download all files under text_generation_benchmark
      • Reformat the files by running
        python scripts/data/prepare_vcgbench_qa_file.py --qa_folder $TEXT_GENERATION_BENCHMARK
        
  2. Download the raw videos from the official websites.

    • Openset VideoQA

    • Multiple Choice VideoQA

    • Text Generation

      • The videos are based on ActivityNet, and you can reuse the one from Openset VideoQA.
  3. Organize the raw videos under playground/data.

    • To directly use our data loaders without changing paths, please organize your datasets as follows

      $ ml-slowfast-llava/playground/data
          ├── video_qa
              ├── MSVD_Zero_Shot_QA
                  ├── videos
                      ├── ...
              ├── MSRVTT_Zero_Shot_QA
                  ├── videos
                      ├── all
                          ├── ...
              ├── TGIF_Zero_Shot_QA
                 ├── mp4
                     ├── ...
              ├── Activitynet_Zero_Shot_QA
                 ├── all_test
                     ├── ...
          ├── multiple_choice_qa
              ├── NExTQA
                  ├── video
                     ├── ...
              ├── EgoSchema
                  ├── video
                     ├── ...
              ├── IntentQA
                  ├── video
                     ├── ...
      

Configuration

We use yaml config to control the design choice of SlowFast-LLaVA. We will use the config of SlowFast-LLaVA-7B as an example to explain some important parameters.

Inference and Evaluation

SlowFast-LLaVA is a training-free method, so we can directly do the inference and evaluation without model training.

By default, we use 8 GPUs for the model inference. We can modify the CUDA_VISIBLE_DEVICES in the config file to accommodate your own settings. Please note that the model inference of SlowFast-LLaVA-34B requires GPUs with at least 80G memory.

cd ml-slowfast-llava
python run_inference.py --exp_config $PATH_TO_CONFIG_FILE

Output Structures

Demo

We provide a script for running video question-answering on a single video.

cd ml-slowfast-llava
python run_demo.py --video_path $PATH_TO_VIDEO --model_path $PATH_TO_LLAVA_MODEL --question "Describe this video in details"

License

This project is licensed under the Apple Sample Code License.

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@article{xu2024slowfast,
	title={SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models},
	author={Xu, Mingze and Gao, Mingfei and Gan, Zhe, and Chen, Hong-You and Lai, Zhengfeng and Gang, Haiming and Kang, Kai and Dehghan, Afshin},
	journal={arXiv:2407.15841},
	year={2024}
}