Home

Awesome

TS-LLaVA

First version of the code has been released.

This is the official implementation for TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

by Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, Marie-Francine Moens.

We explore various visual tokens compression strategies. Our TS-LLaVA achieves the state-of-the-art performance among trianing-free video LLMs.

Our ModelScope repo is available at: https://www.modelscope.cn/models/tingyuqu/TS-LLaVA

Table of contents

Results

Multiple Choice VideoQA:

PWC

PWC

PWC

Multitask Benchmarks

Ranked #9 among all video LLMs: the average accuracy for multple choice questions on MLVU-test Leaderboard

Open-Ended VideoQA & Video-based Text Generation

PWC

PWC

PWC

PWC

PWC

Installation

Building the environment

To create conda env, please run:

conda env create -n llava --file llava.yml
conda activate llava

Install additional packages (llava & flash-attention)

pip install flash-attn --no-build-isolation
pip install -e ".[train]"
<!-- * Two packages, i.e. llava and flash-attention, are commented out from the yml file, as direct installation can cause problems. Please refer to [the original LLaVA repo](https://github.com/haotian-liu/LLaVA) for installing them. -->

Downloading the checkpoints:

The checkpoints for LLaVA-v1.6 can be found here:

git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b .ckpt/llava-v1.6-vicuna-7b
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-34b .ckpt/llava-v1.6-34b

[Optional] To enable GPT evaluation for open-ended video QA, please do the following:

export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY

Dataset preparation

Multiple Choice VideoQA and Open-Ended VideoQA

  1. We prepare the ground-truth question and answer files based on IG-VLM and SF-LLaVA, and put them under playground/gt_qa_files.

    • NExT-QA: Download the NExT_QA.csv from here
    • EgoSchema: Download the EgoSchema.csv from here
    • IntentQA: Download the IntentQA.csv from here

    If you want to run our model for Open-Ended VideoQA and video-based Text Generation, please download the datasets as:

    • MSVD-QA: Download the MSVD_QA.csv from here
    • MSRVTT-QA: Download the MSRVTT_QA.csv from here
    • TGIF-QA: Download the TGIF_FrameQA.csv from here
    • Activitynet-QA: Download the Activitynet_QA.csv from the here
    • VCGBench
      • Download all files under text_generation_benchmark
      • Reformat the files by running
        python scripts/data/prepare_vcgbench_qa_file.py --qa_folder $TEXT_GENERATION_BENCHMARK
        
  2. Reformatting the files:

    • After getting the csv files, please reformat the files (apart from VCGBench) by running
      python scripts/data/prepare_{DATASET}_file.py --qa_file $PATH_TO_CSV_FILE
      
    • replace DATASET with the names of the dataset. Check the scripts/data to make sure the name is correct.
  3. Download the raw videos from the official websites.

    • Multiple Choice VideoQA

    • Open-Ended VideoQA & video-based Text Generation:

    • Store the videos to the dir of your choice (BASE_VIDEO_DIR), and replace BASE_VIDEO_DIR in scripts when needed

Multitask Benchmarks

  1. Download the data:
    • MVBench

      • Download the data from here
      • The official repo can be found here
    • MLVU

      • Download the data from here
      • The official repo can be found here
    • Store the videos in BASE_VIDEO_DIR

Inference and Evaluation

Multiple Choice VideoQA

cd scripts/infer_videos
bash run_qa_{DATASET_NAME}.sh {AGGREGATION_METHOD} {NUM_FRAMES} {NUM_SAMPLED_TOKENS} {PROMPT_VERSION} {IMAGE_ASPECT_RATIO}
The evaluation is automatically done after inference

Multitask Benchmarks

The default arguments AGGREGATION_METHOD, NUM_FRAMES, NUM_SAMPLED_TOKENS, PROMPT_VERSION and IMAGE_ASPECT_RATIO are the same as Multiple Choice VideoQA.

MLVU

cd scripts/infer_videos
bash run_qa_mlvu_mcqa.sh V2 50 2880 v4 resize
Submit the resulting json file to the official evaluation server (https://github.com/JUNJIE99/MLVU) for evaluation 

MVBench

cd scripts/infer_videos
bash run_qa_mlvu_mcqa.sh V2 50 2880 v4 resize {INPUT_FORMAT}
The evaluation is automatically done after inference

Open-Ended VideoQA

The default value for PROMPT_VERSION is v3. The rest are the same as Multiple Choice VideoQA.

Inference

cd scripts/infer_videos
bash run_qa_{DATASET_NAME}.sh V2 50 2880 v3 resize

Evaluation

cd scripts/eval
bash eval_{DATASET_NAME}.sh V2 50 2880 v3 resize {API_KEY}

For VCGBench (Video ChatGPT), the inference and evaluation procedures are similar. Please refer to run_gen_qa_{TASK_TYPE}.sh and eval_gen_qa.sh

Acknowledgement

We extend our gratitude to the following awesome projects: LLaVA, FreeVA, IG-VLM and SF-LLaVA.

Citations

@article{qu2024tsllava,
    title={TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models}, 
    author={Tingyu Qu and Mingxiao Li and Tinne Tuytelaars and Marie-Francine Moens},
    year={2024},
    journal={arXiv preprint arXiv:2411.11066},
}