Home

Awesome

Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment

This is the official implementation of Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment (IJCAI-ECAI2022, Short).

The paper is available at IJCAI-ECAI2022(main only) and arXiv(main and appendix).

Installation

Prerequisite

Hardware

Software

timm library

pip install timm==0.4.9
pip install git+https://github.com/rwightman/pytorch-image-models@more_datasets # 0.5.0

The other libraries

pip install -r requirements.txt

Dataset

Download each datasets and unzip them under the following directory.

./datasets/imagenet2012/train
./datasets/imagenet2012/val
./datasets/imagenet2012/val_c

Quick start

(1) Argument Setting

model={'ViT-B_16', 'ViT-L_16', 'ViT_AugReg-B_16', 'ViT_AugReg-L_16', 'resnet50', 'resnet101', 'mlpmixer_B16', 'mlpmixer_L16', 'DeiT-B', 'DeiT-S', 'Beit-B16_224', 'Beit-L16_224'}
method={'cfa', 't3a', 'shot-im', 'tent', 'pl', 'source'}

(2) Fine-Tuning (Skip)

Our method does not need to alter training phase, i.e., does not need to retrain models from scratch. Therefore, if a fine-tuned model is available, we can skip fine-tuning phase. In this implementation, we use models that are already fine-tuned on ImageNet-2012 dataset.

(3) Calculation of distribution statistics on source dataset

python main.py --calc_statistics_flag --model=${model} --method=${method}

(4) Test-Time Adaptation (TTA) on target dataset

python main.py --tta_flag --model=${model} --method=${method}

Expected results

Top-1 Error Rate on ImageNet-C with severity level=5. ViT_B16 is used as a backbone network.

meangauss_noiseshot_noiseimpulse_noisedefocus_blurglass_blurmotion_blurzoom_blursnowfrostfogbrightnesscontrastelastic_transpixelatejpeg
source61.977.775.177.066.969.158.562.860.957.662.931.688.951.945.342.9
CFA43.956.354.355.448.547.144.344.444.844.841.125.754.233.330.533.5

Citation

@inproceedings{kojima2022robustvit,
  title     = {Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment},
  author    = {Kojima, Takeshi and Matsuo, Yutaka and Iwasawa, Yusuke},
  booktitle = {Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, {IJCAI-22}},
  pages     = {1009--1016},
  year      = {2022},
  month     = {7},
  url       = {https://doi.org/10.24963/ijcai.2022/141},
}

Contact