Home

Awesome

Dynamic-Vision-Transformer (NeurIPS 2021)

This repo contains the official MindSpore code for the Dynamic Vision Transformer (DVT).

Introduction

<p align="center"> <img src="https://github.com/blackfeather-wang/Dynamic-Vision-Transformer/blob/main/figures/examples.png" width= "400"> </p>

We develop a Dynamic Vision Transformer (DVT) to automatically configure a proper number of tokens for each individual image, leading to a significant improvement in computational efficiency, both theoretically and empirically.

<p align="center"> <img src="https://github.com/blackfeather-wang/Dynamic-Vision-Transformer/blob/main/figures/overview.png" width= "810"> </p>

Training

You have to execute script from "src" directory. It will create directory "../results/{DATETIME}__{EXPERIMENT_NAME}" and place results there.

bash scripts/train_ascend.sh {0-7} EXPERIMENT_NAME --config=CONFIG_PATH --device {Ascend (default)|GPU} [TRAIN.PY_ARGUMENTS]

# training for feature reuse and releation reuse
bash scripts/train_ascend.sh 0-7 deit_dvt_12_49_196_w_f_w_r_adamw_originhead_dataaug_mixup --config=configs/local/vit_dvt/deit_dvt_12_49_196_w_f_w_r_adamw_originhead_dataaug_mixup.yml.j2

# training for feature reuse and w/o releation reuse
bash scripts/train_ascend.sh 0-7 deit_dvt_12_49_196_w_f_n_r_adamw_originhead_dataaug_mixup --config=configs/local/vit_dvt/deit_dvt_12_49_196_w_f_n_r_adamw_originhead_dataaug_mixup.yml.j2

# training for w/o feature reuse and releation reuse
bash scripts/train_ascend.sh 0-7 deit_dvt_12_49_196_n_f_w_r_adamw_originhead_dataaug_mixup --config=configs/local/vit_dvt/deit_dvt_12_49_196_n_f_w_r_adamw_originhead_dataaug_mixup.yml.j2

# training for w/o feature reuse and w/o releation reuse
bash scripts/train_ascend.sh 0-7 deit_dvt_12_49_196_n_f_n_r_adamw_originhead_dataaug_mixup --config=configs/local/vit_dvt/deit_dvt_12_49_196_n_f_n_r_adamw_originhead_dataaug_mixup.yml.j2

# inference for feature reuse and releation reuse
bash scripts/inference_ascend.sh 0 deit_dvt_12_49_196_w_f_w_r_adamw_originhead_dataaug_mixup_inference --config=configs/local/vit_dvt/deit_dvt_12_49_196_w_f_w_r_adamw_originhead_dataaug_mixup_inference.yml.j2

Results

modelflopsacc
deit-s/164.60878.67
deit-s/321.14572.116
vit-b/1617.5879.1
vit-b/324.4173.972

<p align="center"> <img src="deit_dvt_vs_vit_inference.png" width= "500"> </p> <p align="center"> <img src="https://github.com/blackfeather-wang/Dynamic-Vision-Transformer/raw/main/figures/result_visual.png" width= "700"> </p>

Requirements

Citation

If you find this work valuable or use our code in your own research, please consider citing us with the following bibtex:

@inproceedings{wang2021not,
        title = {Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition},
       author = {Wang, Yulin and Huang, Rui and Song, Shiji and Huang, Zeyi and Huang, Gao},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
         year = {2021}
}

Contact

This is a MindSpore implementation version. If you have any question, please feel free to contact Yulin Wang: wang-yl19@mails.tsinghua.edu.cn and Guanfu Chen: guanfuchen@zju.edu.cn.