Awesome
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding<br> Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou
:rocket: News
- (Nov 11, 2023)
- Upload 32-frame finetuned ckpt for paragraph-video retrieval.
- (Oct 29, 2023)
- Codes for video pre-training, video qa, video-paragraph retrieval.
- Checkpoints of pre-trained TESTA-base model.
- (Oct 8, 2023)
- Our paper has been accepted by EMNLP 2023 (Findings).
Highlights
Main Contributions
- We introduce an efficient method named TESTA (TEmporal-Spatial Token Aggregation) for long-form video understanding. TESTA progressively aggregates similar visual tokens during video encoding, which can reduce the number of visual tokens by 75% and thus accelerate video encoding.
- Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block.
- Experimental results on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks show that, TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.
Currently, the repository contains the code for pre-training a general-purpose video-language model and fine-tuning it on downstream video understanding tasks including video-paragraph retrieval and VideoQA.
Installation
To install the dependencies, run
# create
conda env create -f environment.yml
# activate
conda activate testa
Data preparation
Please follow the instructions at DATASETS.md to prepare all datasets.
Models
Pre-trained model
zero-shot performance on paragraph-to-video retrieval:
Model | frames | QuerYD R@1 | DiDeMo R@1 | ActivityNet Caption R@1 | GFLOPs | Checkpoint |
---|---|---|---|---|---|---|
TESTA-base (ViT-B/16) | 32 | 64.4 | 64.9 | 37.1 | 786 | testa_model_base_pretrain.pth |
Fine-tuned model
QuerYD paragraph-to-video retrieval
Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
---|---|---|---|---|---|---|
TESTA-base (ViT-B/16) | 32 | 77.0 | 90.8 | 92.6 | 420 | testa_model_base_queryd_f32_f1p12.pth |
ActivityNet paragraph-to-video retrieval
Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
---|---|---|---|---|---|---|
TESTA-base (ViT-B/16) | 32 | 51.6 | 79.1 | 88.3 | 420 | testa_model_base_anet_f32_f1p12.pth |
DiDeMo paragraph-to-video retrieval
Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
---|---|---|---|---|---|---|
TESTA-base (ViT-B/16) | 32 | 57.7 | 83.3 | 89.4 | 420 | testa_model_base_didemo_f32_f1p12.pth |
CondensedMovie paragraph-to-video retrieval
Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
---|---|---|---|---|---|---|
TESTA-base (ViT-B/16) | 32 | 21.5 | 42.4 | 50.7 | 420 | testa_model_base_cm_f32_f1p12.pth |
Training and Evaluation
Please refer to the RUN.md for detailed instructions on training, evaluating and reproducing the results.
Todo list
- Upload fine-tuned checkpoints
- Add visualization code
- Add demos
Contact
If you have any questions, please feel free to create an issue on this repository.
Citation
If you find this code useful for your research, please consider citing:
@article{Ren2023TESTA,
title={TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding},
author={Shuhuai Ren and Sishuo Chen and Shicheng Li and Xu Sun and Lu Hou},
journal={ArXiv},
year={2023},
volume={abs/2310.19060},
}
Acknowledgement
The codebase relies on resources from BLIP, ToMe,and TimeSFormer. We thank the original authors for their open-sourcing.