Awesome
Cross-Modal-Adapter
<p align="center"> <img src='imgs/figure1.png' align="center" width="650px"> </p>This repository will be the official Pytorch implementation for Cross-Modal Adapter.
<!-- ## BibTex @article{ma2022rethinking, title={Rethinking network design and local geometry in point cloud: A simple residual MLP framework}, author={Ma, Xu and Qin, Can and You, Haoxuan and Ran, Haoxi and Fu, Yun}, journal={arXiv preprint arXiv:2202.07123}, year={2022} } -->Title: Cross-Modal Adapter for Text-Video Retrieval
Authors: Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni
Jiwen Lu, Jie Zhou, Shiji Song, Gao Huang (Corresponding Author)
Institute: Tsinghua University, BNRist and Beijing Institute of Technology
Publish: arXiv preprint (arXiv 2211.09623)
Contact: jhj20 at mails dot tsinghua dot edu dot cn
Overview
In this paper, we present a novel Cross-Modal Adapter for parameter-efficient fine-tuning. Although surprisingly simple, our approach has three notable benefits: (1) reduces 99.6% of fine-tuned parameters, and alleviates the problem of overfitting, (2) saves approximately 30% of training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets.
<p align="center"> <img src='imgs/figure2.png' align="center" width="800px"> </p>Results
1.Text2video and video2text retrieval resutls on MSR-VTT.
<p align="center"> <img src='imgs/msrvtt.png' align="center" width="800px"> </p>2. Text2video and video2text retrieval resutls on MSVD, VATEX, DiDeMo, and ActivityNet.
<p align="center"> <img src='imgs/other_four.png' align="center" width="800px"> </p>3. Training efficiency.
<p align="center"> <img src='imgs/efficiency_8gpu.png' align="center" width="800px"> </p>4. Visualizations.
<p align="center"> <img src='imgs/visualization.png' align="center" width="800px"> </p>Acknowledgment
Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.
- CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.
- hyperformer: Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks.