Home

Awesome

Out-of-distribution Generalization Investigation on Vision Transformers

This repository contains PyTorch evaluation code for CVPR 2022 accepted paper Delving Deep into the Generalization of Vision Transformers under Distribution Shifts.

Taxonomy of Distribution Shifts

<p align="middle"> <img src="https://github.com/Phoenix1153/ViT_OOD_generalization/raw/main/img/taxonomy.png" width="60%"> <p>

Illustration of our taxonomy of distribution shifts. We build the taxonomy upon what kinds of semantic concepts are modified from the original image and divide the distribution shifts into four cases: background shifts, corruption shifts, texture shifts, and style shifts. <img src="http://latex.codecogs.com/gif.latex?{\color{Red} \checkmark}" /> denotes the unmodified vision cues under certain type of distribution shifts. Please refer to the literature for details.

Dataset

We build OOD-Net, a collection constituted of data under four types of distribution shift and their in-distribution counterparts, for comprehensive investigation into model out-out-distribution generalization properties. The download link is available here.

DatasetShift Type
ImageNet-9Background Shift
ImageNet-CCorruption Shift
Cue Conflict StimuliTexture Shift
Stylized-ImageNetTexture Shift
ImageNet-RStyle Shift
DomainNetStyle Shift
<!-- Please also cite these references when exploiting the collection. - [1] Kai Yuanqing Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition. In International Conference on Learning Representations, 2021. - [2] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019. - [3] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019. - [4] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. In International Conference on Computer Vision, pages 8320–8329, 2021. - [5] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1406–1415, 2019. --> <!-- - **Background Shifts.** [ImageNet-9](https://github.com/MadryLab/backgrounds_challenge/releases) is adopted for background shifts. ImageNet-9 is a variety of 9-class datasets with different foreground-background recombination plans, which helps disentangle the impacts of foreground and background signals on classification. In our case, we use the four varieties of generated background with foreground unchanged, including 'Only-FG', 'Mixed-Same', 'Mixed-Rand' and 'Mixed-Next'. The 'Original' data set is used to represent in-distribution data. - **Corruption Shifts.** [ImageNet-C](https://zenodo.org/record/2235448#.YMwT5JMzalY) is used to examine generalization ability under corruption shifts. ImageNet-C includes 15 types of algorithmically generated corruptions, grouped into 4 categories: ‘noise’, ‘blur’, ‘weather’, and ‘digital’. Each corruption type has five levels of severity, resulting in 75 distinct corruptions. - **Texture Shifts.** [Cue Conflict Stimuli](https://github.com/rgeirhos/texture-vs-shape/tree/master/stimuli/style-transfer-preprocessed-512) and [Stylized-ImageNet](https://github.com/rgeirhos/Stylized-ImageNet) are used to investigate generalization under texture shifts. Utilizing style transfer, Geirhos et al. generated Cue Conflict Stimuli benchmark with conflicting shape and texture information, that is, the image texture is replaced by another class with other object semantics preserved. In this case, we respectively report the shape and texture accuracy of classifiers for analysis. Meanwhile, Stylized-ImageNet is also produced in Geirhos et al. by replacing textures with the style of randomly selected paintings through AdaIN style transfer. - **Style Shifts.** [ImageNet-R](https://github.com/hendrycks/imagenet-r) and [DomainNet](http://ai.bu.edu/M3SDA/) are used for the case of style shifts. ImageNet-R contains 30000 images with various artistic renditions of 200 classes of the original ImageNet validation data set. The renditions in ImageNet-R are real-world, naturally occurring variations, such as paintings or embroidery, with textures and local image statistics which differ from those of ImageNet images. DomainNet is a recent benchmark dataset for large-scale domain adaptation that consists of 345 classes and 6 domains. As labels of some domains are very noisy, we follow the 7 distribution shift scenarios in Saito et al. with 4 domains (Real, Clipart, Painting, Sketch) picked. -->

Generalization-Enhanced Vision Transformers

<p align="middle"> <img src="https://github.com/Phoenix1153/ViT_OOD_generalization/raw/main/img/new_DANN-1.png" width="40%"> <img src="https://github.com/Phoenix1153/ViT_OOD_generalization/raw/main/img/new_MME-1.png" width="45%"> <p> <p align="middle"> <img src="https://github.com/Phoenix1153/ViT_OOD_generalization/raw/main/img/new_SSL-1.png" width="90%"> <p>

A framework overview of the three designed generalization-enhanced ViTs. All networks use a Vision Transformer <img src="http://latex.codecogs.com/gif.latex?F" /> as feature encoder and a label prediction head <img src="http://latex.codecogs.com/gif.latex?C" /> . Under this setting, the inputs to the models have labeled source examples and unlabeled target examples. top left: T-ADV promotes the network to learn domain-invariant representations by introducing a domain classifier <img src="http://latex.codecogs.com/gif.latex?D" /> for domain adversarial training. top right: T-MME leverage the minimax process on the conditional entropy of target data to reduce the distribution gap while learning discriminative features for the task. The network uses a cosine similarity-based classifier architecture <img src="http://latex.codecogs.com/gif.latex?C" /> to produce class prototypes. bottom: T-SSL is an end-to-end prototype-based self-supervised learning framework. The architecture uses two memory banks <img src="http://latex.codecogs.com/gif.latex?V^s" /> and <img src="http://latex.codecogs.com/gif.latex?V^t" /> to calculate cluster centroids. A cosine classifier <img src="http://latex.codecogs.com/gif.latex?C" /> is used for classification in this framework.

Run Our Code

Environment Installation

conda create -n vit python=3.6
conda activate vit
conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.0 -c pytorch

Before Running

conda activate vit
PYTHONPATH=$PYTHONPATH:.

Evaluation

CUDA_VISIBLE_DEVICES=0 python main.py \
--model deit_small_b16_384 \
--num-classes 345 \
--checkpoint data/checkpoints/deit_small_b16_384_baseline_real.pth.tar \
--meta-file data/metas/DomainNet/sketch_test.jsonl \
--root-dir data/images/DomainNet/sketch/test

Experimental Results

DomainNet

DeiT_small_b16_384

confusion matrix for the baseline model

clipartpaintingrealsketch
clipart80.2533.7555.2643.43
painting36.8975.3252.0831.14
real50.5945.8184.7839.31
sketch52.1635.2748.1971.92

Above used models could be found here.

Remarks

Citation

If you find these investigations useful in your research, please consider citing:

@article{zhang2021delving,
  title={Delving deep into the generalization of vision transformers under distribution shifts},
  author={Zhang, Chongzhi and Zhang, Mingyuan and Zhang, Shanghang and Jin, Daisheng and Zhou, Qiang and Cai, Zhongang and Zhao, Haiyu and Yi, Shuai and Liu, Xianglong and Liu, Ziwei},
  journal={arXiv preprint arXiv:2106.07617},
  year={2021}
}