Home

Awesome

<div align="center">

【CVPR'2023 Highlight🔥&TPAMI】Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Conference Project Paper

</div>

The implementation of CVPR 2023 Highlight (Top 10%) paper Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning.

In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity.

📌 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@inproceedings{jin2023video,
  title={Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning},
  author={Jin, Peng and Huang, Jinfa and Xiong, Pengfei and Tian, Shangxuan and Liu, Chang and Ji, Xiangyang and Yuan, Li and Chen, Jie},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2472--2482},
  year={2023}
}
<details open><summary>💡 I also have other text-video retrieval projects that may interest you ✨. </summary><p>

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model<br> Accepted by ICCV 2023 | [DiffusionRet Code]<br> Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, Jie Chen

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations<br> Accepted by NeurIPS 2022 | [EMCL Code]<br> Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David Clifton, Jie Chen

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment<br> Accepted by IJCAI 2023 | [DiCoSA Code]<br> Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen

</p></details>

📣 Updates

⚡ Demo

<div align="center">

https://user-images.githubusercontent.com/53246557/221760113-4a523e7e-d743-4dff-9f16-357ab0be0d5b.mp4

</div>

😍 Visualization

Example 1

<div align=center> <img src="static/images/Visualization_1.png" width="800px"> </div> <details> <summary><b>More examples</b></summary>

Example 2

<div align=center> <img src="static/images/Visualization_2.png" width="800px"> </div>

Example 3

<div align=center> <img src="static/images/Visualization_3.png" width="800px"> </div>

Example 4

<div align=center> <img src="static/images/Visualization_4.png" width="800px"> </div>

Example 5

<div align=center> <img src="static/images/Visualization_5.png" width="800px"> </div>

Example 6

<div align=center> <img src="static/images/Visualization_6.png" width="800px"> </div>

Example 7

<div align=center> <img src="static/images/Visualization_0.png" width="800px"> </div> </details>

🚀 Quick Start

Setup

Setup code environment

conda create -n HBI python=3.9
conda activate HBI
pip install -r requirements.txt
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Download CLIP Model

cd HBI/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt

Download Datasets

<div align=center>
DatasetsGoogle CloudBaidu YunPeking University Yun
MSR-VTTDownloadDownloadDownload
MSVDDownloadDownloadDownload
ActivityNetTODODownloadDownload
DiDeMoTODODownloadDownload
</div>

Train the Banzhaf Interaction Estimator

Train the estimator according to the label generated by the BanzhafInteraction in HBI/models/banzhaf.py.

The training code is provided in banzhaf_estimator.py. We provide our trained weights, and if you want to apply a to other tasks, you can initialize a new estimator with the weights we provide.

We have tested the performance of Estimator_1e-2_epoch6 with R@1 of 48.2 (log) on the MSR-VTT dataset. If you want better performance, you can train the estimator with a smaller learning rate and more epochs.

<div align=center>
ModelsGoogle CloudBaidu YunPeking University Yunlog
Estimator_1e-2_epoch1DownloadDownloadDownloadlog
Estimator_1e-2_epoch2DownloadDownloadDownloadlog
Estimator_1e-2_epoch3DownloadDownloadDownloadlog
Estimator_1e-2_epoch4DownloadDownloadDownloadlog
Estimator_1e-2_epoch5DownloadDownloadDownloadlog
Estimator_1e-2_epoch6DownloadDownloadDownloadlog
</div>
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=4 \
banzhaf_estimator.py \
--do_train 1 \
--workers 8 \
--n_display 1 \
--epochs 10 \
--lr 1e-2 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/MSRVTT_Videos \
--datatype msrvtt \
--max_words 24 \
--max_frames 12 \
--video_framerate 1 \
--output_dir ${OUTPUT_PATH} 

Text-video Retrieval

<div align=center>
CheckpointGoogle CloudBaidu YunPeking University Yun
MSR-VTTDownloadDownloadDownload
ActivityNetDownloadDownloadDownload
</div>

Eval on MSR-VTT

CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=2 \
main_retrieval.py \
--do_eval 1 \
--workers 8 \
--n_display 50 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/MSRVTT_Videos \
--datatype msrvtt \
--max_words 24 \
--max_frames 12 \
--video_framerate 1 \
--init_model ${CHECKPOINT_PATH} \
--output_dir ${OUTPUT_PATH} 

Train on MSR-VTT

CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=2 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 50 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/MSRVTT_Videos \
--datatype msrvtt \
--max_words 24 \
--max_frames 12 \
--video_framerate 1 \
--estimator ${ESTIMATOR_PATH} \
--output_dir ${OUTPUT_PATH} \
--kl 2 \
--skl 1

Eval on ActivityNet Captions

CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=2 \
main_retrieval.py \
--do_eval 1 \
--workers 8 \
--n_display 50 \
--batch_size_val 128 \
--anno_path ${DATA_PATH}/ActivityNet \
--video_path ${DATA_PATH}/ActivityNet/Activity_Videos \
--datatype activity \
--max_words 64 \
--max_frames 64 \
--video_framerate 1 \
--init_model ${CHECKPOINT_PATH} \
--output_dir ${OUTPUT_PATH} 

Train on ActivityNet Captions

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=8 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 10 \
--epochs 10 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path ${DATA_PATH}/ActivityNet \
--video_path ${DATA_PATH}/ActivityNet/Activity_Videos \
--datatype activity \
--max_words 64 \
--max_frames 64 \
--video_framerate 1 \
--estimator ${ESTIMATOR_PATH} \
--output_dir ${OUTPUT_PATH} \
--kl 2 \
--skl 1

Video-question Answering

<div align=center>
CheckpointGoogle CloudBaidu YunPeking University Yun
MSR-VTT-QADownloadDownloadDownload
</div>

Eval on MSR-VTT-QA

CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=2 \
main_vqa.py \
--do_eval \ 
--num_thread_reader=8 \
--train_csv data/MSR-VTT/qa/train.jsonl \
--val_csv data/MSR-VTT/qa/test.jsonl \
--data_path data/MSR-VTT/qa/train_ans2label.json \
--features_path ${DATA_PATH}/MSRVTT_Videos \
--max_words 32 \
--max_frames 12 \
--batch_size_val 16 \
--datatype msrvtt \
--expand_msrvtt_sentences  \
--feature_framerate 1 \
--freeze_layer_num 0  \
--slice_framepos 2 \
--loose_type \
--linear_patch 2d \
--init_model ${CHECKPOINT_PATH} \
--output_dir ${OUTPUT_PATH}

Train on MSR-VTT-QA

CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=2 \
main_vqa.py \
--do_train \ 
--num_thread_reader=8 \
--epochs=5 \
--batch_size=32 \
--n_display=50 \
--train_csv data/MSR-VTT/qa/train.jsonl \
--val_csv data/MSR-VTT/qa/test.jsonl \
--data_path data/MSR-VTT/qa/train_ans2label.json \
--features_path ${DATA_PATH}/MSRVTT_Videos \
--lr 1e-4 \
--max_words 32 \
--max_frames 12 \
--batch_size_val 16 \
--datatype msrvtt \
--expand_msrvtt_sentences  \
--feature_framerate 1 \
--coef_lr 1e-3 \
--freeze_layer_num 0  \
--slice_framepos 2 \
--loose_type \
--linear_patch 2d \
--estimator ${ESTIMATOR_PATH} \
--output_dir ${OUTPUT_PATH} \
--kl 2 \
--skl 1

🎗️ Acknowledgments

Our code is based on EMCL, CLIP, CLIP4Clip and DRL. We sincerely appreciate for their contributions.