Home

Awesome

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, [Yingya Zhang], Changxin Gao, [Deli Zhao], Nong Sang <br/> In ICCV, 2023. [Paper].

<br/> <div align="center"> <img src="framework.jpg" /> </div> <br/>

Latest

[2023-09] Codes and models are available!

This repo is a modification on the TAdaConv repo.

Installation

Requirements:

optional requirements

Model Zoo

Datasetarchitecturepre-training#framesacc@1acc@5checkpointconfig
SSV2ViT-B/16CLIP868.791.1[google drive]vit-b16-8+16f
SSV2ViT-B/16CLIP1670.292.0[google drive]vit-b16-16+32f
SSV2ViT-B/16CLIP3270.992.1[google drive]vit-b16-32+64f
SSV2ViT-L/14CLIP3273.193.2[google drive]vit-l14-32+64f
K400ViT-B/16CLIP883.696.3[google drive]vit-b16-8+16f
K400ViT-B/16CLIP1684.496.7[google drive]vit-b16-16+32f
K400ViT-B/16CLIP3285.097.0[google drive]vit-b16-32+64f
K400ViT-L/14CLIP3288.097.9[google drive]vit-l14-32+64f
K400ViT-L/14CLIP + K7103289.698.4[google drive]vit-l14-32+64f

Running instructions

You can find some pre-trained models in the Model Zoo.

For detailed explanations on the approach itself, please refer to the paper.

For an example run, set the DATA_ROOT_DIR and ANNO_DIR in configs/projects/dist/vit_base_16_ssv2.yaml, and OUTPUT_DIR in configs/projects/dist/ssv2-cn/vit-b16-8+16f_e001.yaml, and run the command for fine-tuning:

python runs/run.py --cfg configs/projects/dist/ssv2-cn/vit-b16-8+16f_e001.yaml

We use 8 Nvidia V100 GPUs for fine-tuning, and each GPU contains 32 video clips.

Citing DiST

If you find DiST useful for your research, please consider citing the paper as follows:

@inproceedings{qing2023dist,
  title={Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning},
  author={Qing, Zhiwu and Zhang, Shiwei and Huang, Ziyuan and Yingya Zhang and Gao, Changxin and Deli Zhao and Sang, Nong},
  booktitle={ICCV},
  year={2023}
}