Awesome
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning
Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, [Yingya Zhang], Changxin Gao, [Deli Zhao], Nong Sang <br/> In ICCV, 2023. [Paper].
<br/> <div align="center"> <img src="framework.jpg" /> </div> <br/>Latest
[2023-09] Codes and models are available!
This repo is a modification on the TAdaConv repo.
Installation
Requirements:
- Python>=3.6
- torch>=1.5
- torchvision (version corresponding with torch)
- simplejson==3.11.1
- decord>=0.6.0
- pyyaml
- einops
- oss2
- psutil
- tqdm
- pandas
optional requirements
- fvcore (for flops calculation)
Model Zoo
Dataset | architecture | pre-training | #frames | acc@1 | acc@5 | checkpoint | config |
---|---|---|---|---|---|---|---|
SSV2 | ViT-B/16 | CLIP | 8 | 68.7 | 91.1 | [google drive] | vit-b16-8+16f |
SSV2 | ViT-B/16 | CLIP | 16 | 70.2 | 92.0 | [google drive] | vit-b16-16+32f |
SSV2 | ViT-B/16 | CLIP | 32 | 70.9 | 92.1 | [google drive] | vit-b16-32+64f |
SSV2 | ViT-L/14 | CLIP | 32 | 73.1 | 93.2 | [google drive] | vit-l14-32+64f |
K400 | ViT-B/16 | CLIP | 8 | 83.6 | 96.3 | [google drive] | vit-b16-8+16f |
K400 | ViT-B/16 | CLIP | 16 | 84.4 | 96.7 | [google drive] | vit-b16-16+32f |
K400 | ViT-B/16 | CLIP | 32 | 85.0 | 97.0 | [google drive] | vit-b16-32+64f |
K400 | ViT-L/14 | CLIP | 32 | 88.0 | 97.9 | [google drive] | vit-l14-32+64f |
K400 | ViT-L/14 | CLIP + K710 | 32 | 89.6 | 98.4 | [google drive] | vit-l14-32+64f |
Running instructions
You can find some pre-trained models in the Model Zoo
.
For detailed explanations on the approach itself, please refer to the paper.
For an example run, set the DATA_ROOT_DIR
and ANNO_DIR
in configs/projects/dist/vit_base_16_ssv2.yaml
, and OUTPUT_DIR
in configs/projects/dist/ssv2-cn/vit-b16-8+16f_e001.yaml
, and run the command for fine-tuning:
python runs/run.py --cfg configs/projects/dist/ssv2-cn/vit-b16-8+16f_e001.yaml
We use 8 Nvidia V100 GPUs for fine-tuning, and each GPU contains 32 video clips.
Citing DiST
If you find DiST useful for your research, please consider citing the paper as follows:
@inproceedings{qing2023dist,
title={Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning},
author={Qing, Zhiwu and Zhang, Shiwei and Huang, Ziyuan and Yingya Zhang and Gao, Changxin and Deli Zhao and Sang, Nong},
booktitle={ICCV},
year={2023}
}