Home

Awesome

As-ViT: Auto-scaling Vision Transformers without Training [PDF]

<!-- [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/VITA-Group/TENAS.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/VITA-Group/TENAS/context:python) -->

MIT licensed

Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou

In ICLR 2022.

Note: We implemented topology search (sec. 3.3) and scaling (sec. 3.4) in this code base in PyTorch. Our training code is based on Tensorflow and Keras on TPU, which will be released soon.

Overview

We present As-ViT, a framework that unifies the automatic architecture design and scaling for ViT (vision transformer), in a training-free strategy.

Highlights:

<p align="center"> <img src="images/github_teaser.png" alt="teaser" width="1000"/></br> <span align="center"><b>Left</b>: Length Distortion shows a strong correlation with ViT's accuracy. <b>Middle</b>: Auto scaling rule of As-ViT. <b>Right</b>: Progressive re-tokenization for efficient ViT training.</span> </p>

Prerequisites

This repository has been tested on V100 GPU. Configurations may need to be changed on different platforms.

Installation

git clone https://github.com/VITA-Grou/AsViT.git
cd AsViT
pip install -r requirements.txt

1. Seed As-ViT Topology Search

CUDA_VISIBLE_DEVICES=0 python ./search/reinforce.py --save_dir ./output/REINFORCE-imagenet --data_path /path/to/imagenet

This job will return you a seed topology. For example, our search seed topology is 8,2,3|4,1,2|4,1,4|4,1,6|32, which can be explained as below:

<table><thead><tr><th colspan="3">Stage1</th><th colspan="3">Stage2</th><th colspan="3">Stage3</th><th colspan="3">Stage4</th><th rowspan="2">Head</th></tr><tr><th>Kernel K1</th><th>Split S1</th><th>Expansion E1</th><th>Kernel K2</th><th>Split S2</th><th>Expansion E2</th><th>Kernel K3</th><th>Split S3</th><th>Expansion E3</th><th>Kernel K4</th><th>Split S4</th><th>Expansion E4</th></tr></thead><tbody><tr><td>8</td><td>2</td><td>3</td><td>4</td><td>1</td><td>2</td><td>4</td><td>1</td><td>4</td><td>4</td><td>1</td><td>6</td><td>32</td></tr></tbody></table>

2. Scaling

CUDA_VISIBLE_DEVICES=0 python ./search/grow.py --save_dir ./output/GROW-imagenet \
--arch "[arch]" --data_path /path/to/imagenet

Here [arch] is the seed topology (output from step 1 above). This job will return you a series of topologies. For example, our largest topology (As-ViT Large) is 8,2,3,5|4,1,2,2|4,1,4,5|4,1,6,2|32,180, which can be explained as below:

<table><thead><tr><th colspan="4">Stage1</th><th colspan="4">Stage2</th><th colspan="4">Stage3</th><th colspan="4">Stage4</th><th rowspan="2">Head</th><th rowspan="2">Initial Hidden Size</th></tr><tr><th>Kernel K1</th><th>Split S1</th><th>Expansion E1</th><th>Layers L1</th><th>Kernel K2</th><th>Split S2</th><th>Expansion E2</th><th>Layers L2</th><th>Kernel K3</th><th>Split S3</th><th>Expansion E3</th><th>Layers L3</th><th>Kernel K4</th><th>Split S4</th><th>Expansion E4</th><th>Layers L4</th></tr></thead><tbody><tr><td>8</td><td>2</td><td>3</td><td>5</td><td>4</td><td>1</td><td>2</td><td>2</td><td>4</td><td>1</td><td>4</td><td>5</td><td>4</td><td>1</td><td>6</td><td>2</td><td>32</td><td>180</td></tr></tbody></table>

3. Evaluation

Tensorflow and Keras code for training on TPU. To be released soon.

Citation

@inproceedings{chen2021asvit,
  title={Auto-scaling Vision Transformers without Training},
  author={Chen, Wuyang and Huang, Wei and Du, Xianzhi and Song, Xiaodan and Wang, Zhangyang and Zhou, Denny},
  booktitle={International Conference on Learning Representations},
  year={2022}
}