Awesome
<div align="center">γAAAI'2024 π₯γDGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
</div>The official implementation of AAAI24 paper DGL:Dynamic Global-Local Prompt Tuning for Text-Video Retrieval. With only training 0.83 MB parameters, we can surpass fully finetuning/PEFL methods in Text2Video Retrieval.
π Citation
If you find this paper useful, please consider staring π this repo and citing π our paper:
@inproceedings{yang2024dgl,
title={DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval},
author={Yang, Xiangpeng and Zhu, Linchao and Wang, Xiaohan and Yang, Yi},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={7},
pages={6540--6548},
year={2024}
}
Table of Contents
<!--ts--> <!--te-->π£ Updates
- Oct 14 2024: Update code for qb norm and visualization.
- Feb 15 2024: Release the code of DGL.
π Overview
<div align="justify"> Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. <p align="center"> <img src="assets/figure1_weight.png" width="450" height="225"/> <img src="assets/motivation_plot.png" width="340" height="225"/> </p>π Method
<div align="center"> <img src="assets/figure2.png" width="800px" /> </div>π Visualization
DGL can extract global information (bottom) and temporal dynamics (top)
<div align="center"> <img src="assets/figure5.png" width="700" height="auto"> </div> <details> <summary><b>More examples for global information and temporal dynamics </b></summary>global information
<div align=center> <img src="assets/example_global_information.jpg" width="800px"> </div>temporal dynamics
<div align=center> <img src="assets/example_temporal_dynamics.jpg" width="800px"> </div> </details>Since the visualization code need to cache the global prompt on frame weights and we provide another code project for visualization, the full code is provided at visualization code
#unzip the code
#then replace pretrained_weight path(model_dir in mstvtt.sh)
python main.py
π Quick Start
Setup Setup conda environment
conda env create -f environment.yml
Download CLIP Model
Download CLIP pre-trained weights and place them in ${HOME}/models/pretrained
.
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
Download Datasets
MSR-VTT Download the splits and captions from CLIP4clip:
wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip
Download the videos from FrozenοΈ-in-Time:
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip
Prepare data
Video preprocessing can be done by preprocess/compress_video.py.
python preprocess/compress_video.py --input_root [raw_video_path] --output_root [compressed_video_path]
This script will compress the video to 3fps with width 224 (or height 224). Modify the variables for your customization.
Test
Model Zoo
Note that, due to hardware difference, the results may slightly differ. We have test the performance on A100 GPU with T2V/V2T R@1 is 45.8/43.5 log, on A6000 GPU with T2V/V2T R@1 is 45.4/44.1 log.
You can also only adapt global-local video attention with BLIP, following the implementation of tokenmix , you can get T2V/V2T R@1 is 48.9/49.0 log.
<div align=center>Checkpoint | CLIP | Shared Latent Space | Google Cloud |
---|---|---|---|
MSR-VTT | ViT-B/32 | Transformer | Download |
MSR-VTT | ViT-B/16 | Transformer | Download |
VATEX | ViT-B/32 | Linear | Download |
LSMDC | ViT-B/32 | Linear | Download |
ActivityNet | ViT-B/32 | Transformer | Download |
#eval in MSRVTT
#set
do_train=0
do_eval=1
shared_latent_space=transformer/linear
resume='path of ckpt.best.pth.tar'
bash scripts/msrvtt.sh
Search for best performance
Prepare sim matrix and train_test t2v and v2t, search for your best T2V/V2T R@1!
#Search for best performance using QB norm
#set prepare sim matrix in the folder, i,e, msrvtt_vit16_sim_matrix.npy, msrvtt_vit16_train_test_t2v.npy, msrvtt_vit16_train_test_v2t.npy
python search_for_best_r1_with_qb_norm.py
Train
#set
shared_latent_space=transformer/linear
#For DGL-Linear, your can only training with 0.83 MB parameters.
#MSR-VTT
scripts/msrvtt.sh
# VATEX
scripts/vatex.sh
# LSMDC
scripts/lsmdc.sh
# ActivityNet
scripts/activitynet.sh
Acknowledgements
This repo is built upon these previous works.
<!--ts--> <!--te-->