Awesome

<div align="center"> <h1> Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning </h1> <h5 align="center">

</h5> </div>

This repository is the official implementation of Side4Video, which significantly reduces the training memory cost for action recognition and text-video retrieval tasks.

📰 News

Feb 28, 2024. We release our code for Action Recognition and Text-Video Retrieval.
Nov 28, 2023. We release our paper in arxiv.

🗺️ Overview

<div align=center> <img width="795" alt="image" src="imgs/Side4Video.png"> </div>

🚀 Training and Testing

For training and testing our model, please refer to the Recognition and Retrieval folders.

📊 Results

<div align=center> <img width="800" alt="image" src="imgs/memory.png"> </div> Our best model can achieve an accuracy of 67.3% & 74.6 on Something-Something V1 & V2, 88.6% on Kinetics-400 and a Recall@1 of 52.3% on MSR-VTT, 56.1% on MSVD, 68.8% on VATEX.

🖇️ Citation

If you find this repository is useful, please star🌟 this repo and cite🖇️ our paper.

@article{yao2023side4video,
  title={Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning},
  author={Yao, Huanjin and Wu, Wenhao and Li, Zhiheng},
  journal={arXiv preprint arXiv:2311.15769},
  year={2023}
}

👍 Acknowledgment

Our implementation is mainly based on the following codebases. We are sincerely grateful for their work.

Text4Vis: Revisiting Classifier: Transferring Vision-Language Models for Video Recognition.
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.

📧 Contact

If you have any questions about this repository, please file an issue or contact Huanjin Yao or Wenhao Wu .