Home

Awesome

License arXiv

PWC

PWC

PWC

PWC

CenterCLIP

CenterCLIP achieves state-of-the-art text-video retrieval performance and decent computation cost reduction on MSVD, MSRVTT, LSMDC, and ActivityNet through performing multi-segment token clustering on video tokens in the vision transformer of CLIP.

Table of Contents

<!--ts--> <!--te-->

News

Introduction

<div align="justify">

This is the code for the paper <a href="https://arxiv.org/abs/2205.00823"> CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. </a> <br /> In this work, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering. Center tokens from each segment are later concatenated into a new sequence, while their original spatial-temporal relations are well maintained. We instantiate two clustering algorithms to efficiently find deterministic medoids and iteratively partition groups in high dimensional space. Through this token clustering and center selection procedure, we successfully reduce computation costs by removing redundant visual tokens. This method further enhances segment-level semantic alignment between video and text representations, enforcing the spatio-temporal interactions of tokens from within-segment frames. Our method, coined as CenterCLIP, surpasses existing state-of-the-art by a large margin on typical text-video benchmarks, while reducing the training memory cost by 35% and accelerating the inference speed by 14% at the best case.

</div> <div align=center> <img src="misc/overall.png" style="zoom:100%"/> </div>

Features

We are open to pull requests.

Results

MSVD

Experiments on MSVD need at least 2 RTX 3090 GPUs.

<div align=center> <img src="misc/msvd.png" style="zoom:100%"/> </div>

ActivityNet

Experiments on ActivityNet need at least 8 Tesla V100 32GB GPUs.

<div align=center> <img src="misc/act.png" style="zoom:100%"/> </div>

MSR-VTT

<div align=center> <img src="misc/msrvtt.png" style="zoom:100%"/> </div>

LSMDC

<div align=center> <img src="misc/lsmdc.png" style="zoom:100%"/> </div>

Installation

Please install PyTorch-1.9.0 and Python3.6+. PyTorch-1.6.0+ should work.

We recommend you to use our established PyTorch docker image: zhaosssss/torch_lab:1.9.3.

docker pull zhaosssss/torch_lab:1.9.3

If you have not installed docker, see https://docs.docker.com/.

After you install docker and pull our image, you can cd to script directory and run

./run_docker.sh

to create a running docker container.

NOTE: We map some directories in run_docker.sh, if you do not have these directories, you need to modify the script. By default, run_docker.sh runs container in background and you need run docker exec -it ${DOCKER-ID} bash to do some interactive operations.

If you do not want to use docker, try

pip install -r requirements.txt

However, this is not suggested.

Prepare data

Generally, directories are organized as following:

${HOME}
├── dataset             (save the dataset) 
│   │
│   ├── activitynet           
│   ├── lsmdc        
│   └── msrvtt
│
├── models              
│   │
│   ├── eclip           (save the output checkpoints)
│   └── pretrained      (save the CLIP pre-trained weights)
│
├── github              (save the code)
│   │   
│   └── centerclip        
│       │
│       ├── dataloaders
│       ├── modules
│       ├── scripts          
│       └── preprocess 
...

CLIP urls https://github.com/openai/CLIP/blob/e58d49454c92986a1d2a6a48add2333bbfbeaf51/clip/clip.py#L36.

MSR-VTT

Download the splits and captions from CLIP4clip:

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip

Download the videos from Frozen️-in-Time:

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

MSVD

Download videos from https://www.cs.utexas.edu/users/ml/clamp/videoDescription/.

Splits can be found in https://github.com/albanie/collaborative-experts/tree/master/misc/datasets/msvd.

Or you can download them from CLIP4clip

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msvd_data.zip

LSMDC

You must obtain permission from MPII to download and use the data https://sites.google.com/site/describingmovies/download.

The videos are large than 2T, you can use preprocess/download_lsmdc.py to achieve online downloading and resizing.

It is also a multi-processes LSMDC downloader. Set only_down=True for only downloading without resizing.

ActivityNet

Download from http://activity-net.org/download.html. Splits can be found in https://github.com/albanie/collaborative-experts/tree/master/misc/datasets/activity-net or in misc/splits/activitynet.

Training

For the meaning of hyper-parameters, run

python params.py --help

Or see the comments in modules/cluster/cluster.py.

LSMDC

See

scripts/lsmdc.sh

I add some experiments in the file, you can choose and run them.

Be careful about the batch_size and your gpu numbers. Generally, batch_size x #GPUs = 128 as I use 128 as the total batch size. batch_size in the scripts means single gpu batch size.

MSVD

scripts/msvd.sh

MSR-VTT

scripts/msrvtt.sh

ActivityNet

scripts/activitynet.sh

Monitoring the training process through tensorboard

tensorboard --logdir=your_logdir --port=your_port

# or run scripts/tensorboard.sh

Checkpoints

Checkpoints trained on Tesla V100 GPUs are not available now. We provide some checkpoints trained on 2 RTX 3090 GPUs for you to play around with. Results of checkpoints on LSMDC are the same as the paper's data. Checkpoints on MSR-VTT and MSVD come from middle stages of our work. They have comparable performance with the paper's results (CenterCLIP, ViT-B/32).

Third-party reproduction and checkpoints are warmly welcomed.

Each zip file contains 4 types of files

Checkpoint IDDatasetT2V R@1V2T R@1URL
eclip_new_abla_lsmdc_04lsmdc21.921.1zip file
eclip_new_abla_lsmdc_09lsmdc21.721.4zip file
eclip_new_abla_lsmdc_22lsmdc21.620.6zip file
eclip_new_abla_lsmdc_23lsmdc21.419.5zip file
eclip_msrvtt_62msrvtt (7k) / 1k-A44.141.9zip file
eclip_msrvtt_63msrvtt (7k) / 1k-A44.243.2zip file
eclip_msrvtt_80msrvtt (7k) / 1k-A43.942.6zip file
eclip_msvd_22msvd47.561.4zip file

Set

# train or eval
do_train=0
do_eval=1

in the training scripts to get the evaluation results of these checkpoints.

Corresponding settings are ready in the bash scripts.

Citations

@inproceedings{2022_centerclip,
  author    = {Shuai Zhao and Linchao Zhu and Xiaohan Wang and Yi Yang},
  title     = {CenterCLIP: Token Clustering for Efficient Text-Video Retrieval},
  booktitle = {{SIGIR} '22: The 45th International {ACM} {SIGIR} Conference on Research
			   and Development in Information Retrieval, July 11–15, 2022, Madrid, Spain},
  year      = {2022},
}

Licenses

This project is under the CC-BY-NC 4.0 license. See LICENSE for details..

Acknowledgements

<!--ts--> <!--te-->