Awesome

⏰ Test of Time: Instilling Video-Language Models with a Sense of Time

Code for our CVPR 2023 paper on instilling a sense of time in video-language models.

<a href="https://bpiyush.github.io/testoftime-website/">[Project Page]</a>   |   <a href="https://arxiv.org/abs/2301.02074">[Arxiv]</a>   |   <a href="#📚-datasets">[Data]</a>   |   <a href="https://github.com/bpiyush/TestOfTime"> [Code]</a>   |   <a href="https://www.youtube.com/watch?v=RTRRdrA5H88"> [CVPR 2023 Video]</a>   |   <a href="https://bpiyush.github.io/testoftime-website/media/testoftime-v1.2-3.pdf"> [CVPR 2023 Poster]</a>  <img src="media/tact-slow.gif" width="600">

Brief Overview
Updates
Installation & Setup
Datasets
Models
Post-pretraining: TACT
Evaluation: TACT
Evaluation: Downstream Tasks
Citation
Acknowledgements

🔭 Brief Overview

We show that existing video-language models struggle to associate time order in video and language through a controlled experiment on synthetic data.
Based on VideoCLIP, we propose TACT (Temporal Adaptation by Consistent Time-ordering), a method for temporal adaptation using this time order consistency without having to pretrain from scratch.
We demonstrate improved zeroshot generalizability of our temporally adapted models on tasks that require higher time awareness.

📅 Updates

24th March 2023: Code released.
11th June 2024: On our synthetic benchmark, Video-LLAMA achieves an impressive 88.33% accuracy. We will continue to add evaluation of more recent LLM models on our synthetic benchmark. TimeChat achieves 76.67%. We have added a benchmark on https://paperswithcode.com/.

🚀 Installation & Setup

Create a conda environment and install packages as described in setup/env.md. We recommend running python setup/check_packages.py to check if all packages are installed correctly.

📚 Datasets

We use a combination of synthetic and real datasets to evaluate our approach. Below, you can find instructions to download and prepare the datasets. Here, we present instructions for our Synthetic dataset and the TEMPO-TL dataset.

For each dataset, we provide a .zip file that contains (i) train-test splits, (ii) S3D features for video (at 1 FPS) that serve as input to VideoCLIP model. Use the following to download all datasets:

bash setup/download_datasets.sh /path/to/datasets/

Pass the path to folder where you want to store the datasets (e.g., ./all_data/).

Synthetic data

We create simple synthetic video-language pairs by stitching together a pair of events (e.g., "a red circle appears" and "a yellow circle appears") with text description connected by before/after relations. An example is shown here:

<img src="media/synthetic-data-v3.gif" width="500">

TEMPO-TL dataset

As a real dataset, we consider the TEMPO-TL dataset that similarly stitches together a pair of events in text for clips in the same video.

<img src="media/tempo-data-v1.gif" width="500">

New datasets: In order to evaluate our approach on other (new) datasets, you need to first generate and save S3D video features. See this for an open-source feature extractor. Then, create splits, create a dataset object in package/datasets/. Please see package/datasets/tempo.py for reference.

🤖 Models

We base our experiments on the VideoCLIP model from FAIR. Instructions in setup/env.md include download of relevant checkpoints for VideoCLIP.

Checkpoint zoo: Here, we provide checkpoints for TACT adapted VideoCLIP models post-pretrained on (i) TEMPO-TL, (ii) ActivityNet, (iii) Charades, (iv) Charades-Ego.

Post-pretraining Dataset		Hyperparameters		Download link
	$\alpha_{\text{same}}$	$\alpha_{\text{cross}}$	$\beta$
TEMPO-TL	1.0	1.0	1.0	Link
ActivityNet	1.0	1.0	0.0	Link
Charades	1.0	1.0	0.0	Link
Charades-Ego	1.0	1.0	1.0	Link

To download all checkpoints in one go, run:

bash setup/download_checkpoints.sh /path/to/checkpoints/

Pass the path to folder where you want to store the checkpoints (e.g., ./all_checkpoints/).

🏋️‍♀️ Post-pretraining: TACT

Post-pretraining on TEMPO-TL dataset
```
python postpretrain.py --dataset tempo --eval_subset temporal_1k --no_wandb --data_root /ssd/pbagad/datasets/ --only_train
```
Replace --data_root with the path to where all your dataseta are stored. Make sure to change entity and project arguments in postpretrain.py to log to your own wandb account.

📊 Evaluation: TACT

Evaluate on `TEMPO` dataset

Pre-trained VideoCLIP

python postpretrain.py --dataset tempo --eval_subset temporal_1k --eval_split test --only_eval --no_wandb --data_root /ssd/pbagad/datasets/

Replace --data_root with the path to where all your dataseta are stored. This should yield about 52% accuracy.

TACT post-pretrained VideoCLIP

ckpt=/path/to/tact/checkpoint/trained/on/TEMPO/
# For example, ckpt=./all_checkpoints/tempo-hparams_1.0_1.0_1.0-epoch=27-step=8288.ckpt
python postpretrain.py --dataset tempo --eval_subset temporal_1k --eval_split test --only_eval --no_wandb --data_root /ssd/pbagad/datasets/ -c $ckpt

Replace --data_root with the path to where all your dataseta are stored. This should yield about 66% accuracy.

The detailed results on more datasets are provided in the paper and also shown below.

Evaluate on `Synthetic` dataset

TACT post-pretrained (on TEMPO)

ckpt=/path/to/tact/checkpoint/trained/on/TEMPO/
# For example, ckpt=./all_checkpoints/tempo-hparams_1.0_1.0_1.0-epoch=27-step=8288.ckpt
python postpretrain.py --dataset synthetic --eval_subset v2.0 --eval_split test --only_eval --no_wandb --data_root /ssd/pbagad/datasets/ -c $ckpt --gpus 0

Replace --data_root with the path to where all your dataseta are stored. This should yield about 65% accuracy. Note that since this is tiny evaluation set, using multiple GPUs will lead to incorrect accuracies because of aggregating results across GPUs.

TACT post-pretrained (on Charades-Ego)

ckpt=/path/to/tact/checkpoint/trained/on/Charades-Ego/
# For example, ckpt=./all_checkpoints/charadesego-hparams_1.0_1.0_1.0-epoch\=2-step\=3639.ckpt
python postpretrain.py --dataset synthetic --eval_subset v2.0 --eval_split test --only_eval --no_wandb --data_root /ssd/pbagad/datasets/ -c $ckpt --gpus 0

Replace --data_root with the path to where all your dataseta are stored. This should yield about 85% accuracy.

📊 Evaluation: Downstream Tasks

To illustrate zero-shot performance of our TACT adapted model on a downstream task, we provide code to run the following evaluations.

Video Question Answering on `AGQA`

Here, we evaluate VideoQA on a subset of the AGQA dataset.

An example instance from the AGQA dataset is shown below:

<img src="media/agqa-sample-v2.jpg" width="600">

Note that, to run this, you need the pre-computed S3D features for the AGQA dataset.

Pre-trained VideoCLIP
```
python downstream_zeroshot.py --data_root /ssd/pbagad/datasets/ --dataset agqa --task videoqa --no_save
```
Replace --data_root with the path to where all your dataseta are stored. This should yield about 49.9% accuracy.

TACT post-pretrained VideoCLIP

ckpt=/path/to/tact/checkpoint/trained/on/TEMPO/
# For example, ckpt=./all_checkpoints/tempo-hparams_1.0_1.0_1.0-epoch=27-step=8288.ckpt
python downstream_zeroshot.py --data_root /ssd/pbagad/datasets/ --dataset agqa --task videoqa --no_save -c $ckpt

Replace --data_root with the path to where all your dataseta are stored. This should yield about 57.1% accuracy.

Action Retrieval on `SSv2`

Here, we evaluate Action Retrieval on a subset of the SSv2 dataset.

An example instance from the SSv2 dataset is shown below:

Note that, to run this, you need the pre-computed S3D features for the SSv2 dataset.

Pre-trained VideoCLIP

python downstream_zeroshot.py --data_root /ssd/pbagad/datasets/ --dataset ssv2 --task action_retrieval --no_save --split "validation-tmpl-ret-singularity"

Replace --data_root with the path to where all your dataseta are stored. This should yield about 3.4% mAP (metric_t2v_mAP).

TACT post-pretrained VideoCLIP

ckpt=/path/to/tact/checkpoint/trained/on/TEMPO/
# For example, ckpt=./all_checkpoints/tempo-hparams_1.0_1.0_1.0-epoch=27-step=8288.ckpt
python downstream_zeroshot.py --data_root /ssd/pbagad/datasets/ --dataset ssv2 --task action_retrieval  --no_save --split "validation-tmpl-ret-singularity" -c $ckpt

Replace --data_root with the path to where all your dataseta are stored. This should yield about 4.2% mAP (metric_t2v_mAP).

The detailed results on more datasets/tasks are provided in the paper and also shown below.

📖 Citation

If you found our work useful or relevant, please consider citing our paper:

@inproceedings{
      bagad2023testoftime,
      title={{T}est of {T}ime: {I}nstilling {V}ideo-{L}anguage {M}odels with a {S}ense of {T}ime},
      author={Bagad, Piyush and Tapaswi, Makarand and Snoek, Cees G. M.},
      booktitle={CVPR},
      year={2023}
}

🙏 Acknowledgements

We acknowledge support from the ELLIS Amsterdam Unit and the AMS Scholarhsip to Piyush as a Master's student.
We also thank Dr. Dennis Koelma for regular help with compute infrastructure and hosting of data and models, and, we thank Dr. Hazel Doughty for useful discussions.
We also acknowledge all relevent prior work, particularly, VideoCLIP and TEMPO, for making their code and data publicly available.

Additional Notes

:warning: Infra note: Our code has been run on a single node with 4 GPUs (either NVIDIA RTX A5000 or NVIDIA GeForce 1080). Running it on different infrastructures may cause differences in results. However, the trends and inferences should be similar (e.g., post-pretraining helps with temporal ordering task, etc.).

💡: If you have any issues or suggestions, feel free to open an issue or contact us via email.

Closely Related Work

Please also consider looking at the following related papers:

Wu et al, Audio-Text Models Do Not Yet Leverage Natural Language. Like us, they too check if models capture event ordering, albeit for audio-text models.
Yuksekgonul et al, When and why vision-language models behave like bags-of-words, and what to do about it?, ICLR 2023. They test image-language models for understanding of object propertries, relational understanding and order sensitivity.
Hazra et al, EgoTV : Egocentric Task Verification from Natural Language Task Descriptions, ArXiv 2023. They propose a synthetic benchmark of procedural tasks where there is an order between the subtasks, e.g., apple is heated, then, it is cleaned.
Xu et al, Don’t Pour Cereal into Coffee: Differentiable Temporal Logic for Temporal Action Segmentation, NeurIPS 2022. They propose use of temporal logic to apply declarative temporal constraints to the output of deep networks.
Xie et al, Enhance Temporal Relations in Audio Captioning with Sound Event Detection, ArXiV 2023. This paper aims to better capture temporal relationships in caption generation with sound event detection (SED), a task that locates events’ timestamps.

Awesome

⏰ Test of Time: Instilling Video-Language Models with a Sense of Time

Table of Contents

🔭 Brief Overview

📅 Updates

🚀 Installation & Setup

📚 Datasets

Synthetic data

TEMPO-TL dataset

🤖 Models

🏋️‍♀️ Post-pretraining: TACT

📊 Evaluation: TACT

Evaluate on TEMPO dataset

Evaluate on Synthetic dataset

📊 Evaluation: Downstream Tasks

Video Question Answering on AGQA

Action Retrieval on SSv2

📖 Citation

🙏 Acknowledgements

Additional Notes

Closely Related Work

Evaluate on `TEMPO` dataset

Evaluate on `Synthetic` dataset

Video Question Answering on `AGQA`

Action Retrieval on `SSv2`