Home

Awesome

⏰ Test of Time: Instilling Video-Language Models with a Sense of Time

Code for our CVPR 2023 paper on instilling a sense of time in video-language models.

<p align="center"> <a href="https://bpiyush.github.io/testoftime-website/">[Project Page]</a> &nbsp; | &nbsp; <a href="https://arxiv.org/abs/2301.02074">[Arxiv]</a> &nbsp; | &nbsp; <a href="#πŸ“š-datasets">[Data]</a> &nbsp; | &nbsp; <a href="https://github.com/bpiyush/TestOfTime"> [Code]</a> &nbsp; | &nbsp; <a href="https://www.youtube.com/watch?v=RTRRdrA5H88"> [CVPR 2023 Video]</a> &nbsp; | &nbsp; <a href="https://bpiyush.github.io/testoftime-website/media/testoftime-v1.2-3.pdf"> [CVPR 2023 Poster]</a> <p> <!-- <p align="center"> <img src="https://user-images.githubusercontent.com/19412343/225776400-0abb7dad-320f-497f-b578-22efc86a59d5.gif" width="600"> </p> --> <p align="center"> <img src="media/tact-slow.gif" width="600"> </p>

Table of Contents

πŸ”­ Brief Overview

πŸ“… Updates

πŸš€ Installation & Setup

Create a conda environment and install packages as described in setup/env.md. We recommend running python setup/check_packages.py to check if all packages are installed correctly.

πŸ“š Datasets

We use a combination of synthetic and real datasets to evaluate our approach. Below, you can find instructions to download and prepare the datasets. Here, we present instructions for our Synthetic dataset and the TEMPO-TL dataset.

For each dataset, we provide a .zip file that contains (i) train-test splits, (ii) S3D features for video (at 1 FPS) that serve as input to VideoCLIP model. Use the following to download all datasets:

bash setup/download_datasets.sh /path/to/datasets/

Pass the path to folder where you want to store the datasets (e.g., ./all_data/).

Synthetic data

We create simple synthetic video-language pairs by stitching together a pair of events (e.g., "a <span style="color:red">red</span> circle appears" and "a <span style="color:yellow">yellow</span> circle appears") with text description connected by before/after relations. An example is shown here:

<!-- ![Synthetic data](media/synthetic-data-v3.gif) --> <p align="center"> <img src="media/synthetic-data-v3.gif" width="500"> </p>

TEMPO-TL dataset

As a real dataset, we consider the TEMPO-TL dataset that similarly stitches together a pair of events in text for clips in the same video.

<!-- ![TEMPO-TL data](media/tempo-data-v1.gif) --> <p align="center"> <img src="media/tempo-data-v1.gif" width="500"> </p>

New datasets: In order to evaluate our approach on other (new) datasets, you need to first generate and save S3D video features. See this for an open-source feature extractor. Then, create splits, create a dataset object in package/datasets/. Please see package/datasets/tempo.py for reference.

πŸ€– Models

We base our experiments on the VideoCLIP model from FAIR. Instructions in setup/env.md include download of relevant checkpoints for VideoCLIP.

Checkpoint zoo: Here, we provide checkpoints for TACT adapted VideoCLIP models post-pretrained on (i) TEMPO-TL, (ii) ActivityNet, (iii) Charades, (iv) Charades-Ego.

Post-pretraining DatasetHyperparametersDownload link
$\alpha_{\text{same}}$$\alpha_{\text{cross}}$$\beta$
TEMPO-TL1.01.01.0Link
ActivityNet1.01.00.0Link
Charades1.01.00.0Link
Charades-Ego1.01.01.0Link

To download all checkpoints in one go, run:

bash setup/download_checkpoints.sh /path/to/checkpoints/

Pass the path to folder where you want to store the checkpoints (e.g., ./all_checkpoints/).

πŸ‹οΈβ€β™€οΈ Post-pretraining: TACT

πŸ“Š Evaluation: TACT

Evaluate on TEMPO dataset

The detailed results on more datasets are provided in the paper and also shown below.

<p align="center"> <img src="media/results-tact-v1.png" width="400"> </p>

Evaluate on Synthetic dataset

πŸ“Š Evaluation: Downstream Tasks

To illustrate zero-shot performance of our TACT adapted model on a downstream task, we provide code to run the following evaluations.

Video Question Answering on AGQA

Here, we evaluate VideoQA on a subset of the AGQA dataset.

An example instance from the AGQA dataset is shown below:

<!-- ![AGQA data](media/agqa-sample-v2.jpg) --> <p align="center"> <img src="media/agqa-sample-v2.jpg" width="600"> </p>

Note that, to run this, you need the pre-computed S3D features for the AGQA dataset.

Action Retrieval on SSv2

Here, we evaluate Action Retrieval on a subset of the SSv2 dataset.

An example instance from the SSv2 dataset is shown below:

<p align="center"> <img src="media/ssv2-example-v1.jpg" width="600"> </p>

Note that, to run this, you need the pre-computed S3D features for the SSv2 dataset.

The detailed results on more datasets/tasks are provided in the paper and also shown below.

<p align="center"> <img src="media/results-downstream-v1.png" width="800"> </p>

πŸ“– Citation

If you found our work useful or relevant, please consider citing our paper:

@inproceedings{
      bagad2023testoftime,
      title={{T}est of {T}ime: {I}nstilling {V}ideo-{L}anguage {M}odels with a {S}ense of {T}ime},
      author={Bagad, Piyush and Tapaswi, Makarand and Snoek, Cees G. M.},
      booktitle={CVPR},
      year={2023}
}

πŸ™ Acknowledgements

Additional Notes

:warning: Infra note: Our code has been run on a single node with 4 GPUs (either NVIDIA RTX A5000 or NVIDIA GeForce 1080). Running it on different infrastructures may cause differences in results. However, the trends and inferences should be similar (e.g., post-pretraining helps with temporal ordering task, etc.).

πŸ’‘: If you have any issues or suggestions, feel free to open an issue or contact us via email.

Closely Related Work

Please also consider looking at the following related papers: