Home

Awesome

<img src="assets/run.ico" width=70 height=70/>

VELOCITI: Can Video Language Models Bind Semantic Concepts Through Time?

<p float="left"> <a href="https://arxiv.org/abs/2406.10889v1"> <img src="https://img.shields.io/badge/arXiv-2406.10889-b31b1b.svg" alt="arXiv"/> </a> <a href="https://katha-ai.github.io/projects/velociti/"> <img src="https://img.shields.io/badge/github%20pages-121013?style=for-the-badge&logo=github&logoColor=white" alt="Github Pages"/> </a> </p>

Welcome to the VELOCITI, this repository is to provide code for Evaluation of models on VELOCITI, and provide a jupyter notebook to visualize all the data presented in the benchmark.

โญ๏ธ For instant visualization of data samples, please visit our Project Page

Set-Up for Visualizing Data ๐Ÿ“Š

Setting-up the Environment For CLIP Code and Data Visualiser

Create an environment in the choice of your environment manager, and simply install the requirement via

cd VELOCITI
# activate your conda or venv environment
pip install -r environments/clip_vis_requirements.txt

The code is tested to run with python 3.10.14.

Setting-up Data ๐Ÿ’ฟ

pip install gdown
cd VELOCITI

Then in a python terminal or a file,

import gdown
gdown.download('https://drive.google.com/uc?id=1aKxJL-xv6rS9ChqeLtokKIXBGaRMdD9w', 'velociti_data.zip')

Unzip the data, and you should see the directory structure as below.

Unzip the contents of the .zip, and ensure the following directory structure

.
โ”œโ”€โ”€ LICENSE.txt
โ”œโ”€โ”€ data
โ”‚ย ย  โ”œโ”€โ”€ action_adv.json
โ”‚ย ย  โ”œโ”€โ”€ action_bind.json
โ”‚ย ย  โ”œโ”€โ”€ action_mod.json
โ”‚ย ย  โ”œโ”€โ”€ agent_bind.json
โ”‚ย ย  โ”œโ”€โ”€ agent_iden.json
โ”‚ย ย  โ”œโ”€โ”€ control.json
โ”‚ย ย  โ”œโ”€โ”€ coref.json
โ”‚ย ย  โ”œโ”€โ”€ frames  [900 entries]
โ”‚ย ย  โ”œโ”€โ”€ pos_caps.json
โ”‚ย ย  โ”œโ”€โ”€ sequence.json
โ”‚ย ย  โ”œโ”€โ”€ vidsitu_dict.json
โ”‚ย ย  โ””โ”€โ”€ videos
        โ”œโ”€โ”€ velociti_videos_10s  [900 entries]
        โ””โ”€โ”€ velociti_videos_4s  [900 entries]

Ready for browsing the provided Jupyter Notebook !

CLIP Model Evaluations

Environment Setup ๐ŸŒ

Activate the same environment set-up above,

conda activate velo
pip install -r environments/requirements.txt

NOTE ๐Ÿ””

ViFi-CLIP model has to be manually downloaded and placed within the folder .hfcache (the default cache directory as in the code) in the root directory of this repository. Precisely, this link may be used.

Rest all the models, will be automatically downloaded by the script, inside the directory .hfcache in the root folder.

Note: It is observed that CLIP-ViP maybe slow to download, the script downloads the model and might consume some time in doing so. The model is available here, if required.

If, you wish a different path for the cache files, modify the main_eval.py file accordingly, and also the above cache paths locations of the model files.

After ensuring the above directory structure, simply run

python main_eval.py --num_workers 4 \
                    --all \
                    --output output \ 
                    --exhaustive_log \
                    --seed 1000

This will download and run the evaluation on all the models. If a specific model is to be checked (say clip_B_32), then run,

python main_eval.py --num_workers 4 \
                    --model clip_B_32 \
                    --output output \ 
                    --exhaustive_log \
                    --seed 1000

The exhasutive_log saves the output for every sample in the benchmark, and if that level of logging is not required, simply run the evaluation without it, by

python main_eval.py --num_workers 4 \
                    --all \
                    --output output \ 
                    --seed 1000

Plausibility Estimation ๐Ÿงช

VERA

To establish that captions in VELOCITI indeed require visual modality and can't simply be solved by a textual reasoning model, we evaluate and provide the script for VERA, a plausibility estimation model. The scripts with some more instructions are in the folder VERA/

Video-LLM Evaluations ๐Ÿ“ฝ๏ธ

We also provide the scripts for evaluating the Video-LLMs in our work. These scripts may need to be updated as per changes in their original repositories. Though cloning this repo and working on inferencing these models might work, we recommend cloning respective model as separate projects, and setting up separate environments as per their documentations.

Video LLaVA

This can be evaluated by running the below command with the test name. Scripts are provided in the folder video_llava/

python video_llava_eval.py --test ivat \
                           --output output \
                           --seed 1000

<hr>

PLLaVA

Please refer to the folder pllava/

mPLUG-Owl-Video and Owl-Con Evaluation

Please check the folder videocon/

Gemini 1.5 Flash

Please check the folder gemini/

BibTeX

If you find our work useful, please cite as below

@article{velociti,
        title={VELOCITI: Can Video-Language Models Bind Semantic Concepts Through Time?},
        author={Saravanan, Darshana and Singh, Darshan and Gupta, Varun and Khan, Zeeshan and Gandhi, Vineet and Tapaswi, Makarand},
        journal={arXiv:2406.10889},
        year={2024}
    }

CC BY-NC-SA 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0