Awesome
Self-Distilled Vision Transformer for Domain Generalization (ACCV'22 -- Oral)
Maryam Sultana, Muzammal Naseer, Muhammad Haris Khan, Salman Khan, and Fahad Shahbaz Khan
Abstract: In the recent past, several domain generalization (DG) methods have been proposed, showing encouraging performance, however, almost all of them build on convolutional neural networks (CNNs). There is little to no progress on studying the DG performance of vision transformers (ViTs), which are challenging the supremacy of CNNs on standard benchmarks, often built on i.i.d assumption. This renders the real-world deployment of ViTs doubtful. In this paper, we attempt to explore ViTs towards addressing the DG problem. Similar to CNNs, ViTs also struggle in out-of-distribution scenarios and the main culprit is overfitting to source domains. Inspired by the modular architecture of ViTs, we propose a simple DG approach for ViTs, coined as self-distillation for ViTs. It reduces the overfitting of source domains by easing the learning of input-output mapping problem through curating non-zero entropy supervisory signals for intermediate transformer blocks. Further, it does not introduce any new parameters and can be seamlessly plugged into the modular composition of different ViTs. We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets. Moreover, we report favorable performance against recent state-of-the-art DG methods. Our code along with pre-trained models are made available publicly.
State-of-the-Art Vision Transformers for Domain Generalization
PACS
CvT-21
- 88.9 ± 0.5 @ 224DeiT-Small
- 86.7 ± 0.2 @224T2T-ViT-14
- 87.8 ± 0.6 @224
VLCS
CvT-21
- 81.9 ± 0.4 @ 224DeiT-Small
- 81.6 ± 0.1 @224T2T-ViT-14
- 81.2 ± 0.3 @224
OfficeHome
CvT-21
- 77.0 ± 0.2 @ 224DeiT-Small
- 72.5 ± 0.3 @224T2T-ViT-14
- 75.5 ± 0.2 @224
TerraIncognita
CvT-21
- 51.4 ± 0.7 @ 224DeiT-Small
- 44.9 ± 0.4 @224T2T-ViT-14
- 50.5 ± 0.6 @224
DomainNet
CvT-21
- 52.0 ± 0.0 @ 224Deit-Small
- 47.4 ± 0.1 @224T2T-ViT-14
- 50.2 ± 0.1 @224
Citation
If you find our work useful. Please consider giving a star :star: and a citation.
@InProceedings{Sultana_2022_ACCV,
author = {Sultana, Maryam and Naseer, Muzammal and Khan, Muhammad Haris and Khan, Salman and Khan, Fahad Shahbaz},
title = {Self-Distilled Vision Transformer for Domain Generalization},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
month = {December},
year = {2022},
pages = {3068-3085}
}
Contents
- Highlights
- Installation
- Datasets
- Training Self-Distilled Vision Transformer
- Pretrained Models
- Evaluating for Domain Generalization
- Attention Visualizations
Highlights
- Inspired by the modular architecture of ViTs, we propose a light-weight plug-and-play DG approach for ViTs, namely self-distillation for ViT (SDViT). It explicitly encourages the model towards learning generalizable, comprehensive features.
- We show that by improving the intermediate blocks, which are essentially multiple feature pathways, through soft supervision from final classifier facilitates the model towards learning crossdomain generalizable features. Our approach naturally fits into the modular and compositional architecture of different ViTs, and does not introduce any new parameters. As such it adds a minimal training overhead over the baseline.
Installation
To install conda env with conda, run the following command in your terminal:
conda env create -n ViT_DGbed --file ViT_DGbed.yml
Activate the conda environment:
conda activate ViT_DGbed
Datasets
python3 -m domainbed.scripts.download \
--data_dir=./domainbed/data --dataset pacs
Note: for downloading other datasets change --dataset pacs with other datasets (e.g., vlcs, office_home, terra_incognita, domainnet).
Model selection criteria
We computed results on the following model selection
IIDAccuracySelectionMethod
: A random subset from the input data of the training source domains.
Training Self-Distilled Vision Transformer
- Step 1: Download the pretrained models on Imagenet, such as CVT-21, T2T-ViT-14
- Step 2: Place the models in the path ./domainbed/pretrained_models/Model_name/
- Step 3: Run the followng commands:
Launching a sweep on ViT Baselines:
./Baseline_sweep.sh
Launching a sweep on SDViT Model:
./Grid_Search_sweep.sh
Note: For above all commands change --dataset PACS for training on other datasets such as OfficeHome, VLCS, TerraIncognita and DomainNet and backbone to CVTSmall or T2T14.
Pretrained Models
Pretrained ViT models:
Dataset | Baseline (ERM-ViT) | Ours (ERM-SDViT) |
---|---|---|
PACS | Link | Link |
VLCS | Link | Link |
OfficeHome | Link | Link |
TerraIncognita | Link | Link |
DomainNet | Link | Link |
Evaluating for Domain Generalization
To view the results using our pre-trained models:
- Step 1: Download the pretrained models uisng the links in above Table and place them dataset wise under the folder `Results'
- Step 2: Run the following command to get outputs
python -m domainbed.scripts.collect_results\
--input_dir=/Results/Dataset/Model/Backbone/ --get_recursively True
Note: Replace the text with dataset and model names (e.g: Results/PACS/ERM-ViT/DeiT-Small/ and so on....) to view results on various models. Test-Time Classifier Adjuster (T3A) is exploited in our proposed method as a complimentary approach, for details please refer to following instructions: T3A
Results:
- Accuracy on three Backbone Networks using PACS dataset.
- Accuracy on three Backbone Networks using five benchmark datasets in comparison with DG SOTA.
Attention Visualizations
<p align="center"> <img src="https://github.com/maryam089/SDViT/blob/main/Figures/PACS_git.png" > </p> Comparison of attention maps between the baseline ERM-ViT and our proposed ERM-SDViT (backbone: DeiT-Small) on four target domains of PACS dataset. <p align="center"> <img src="https://github.com/maryam089/SDViT/blob/main/Figures/Attentions_VLCS_OH.png" > </p> Comparison of attention maps between the baseline ERM-ViT and our proposed ERM-SDViT (backbone: DeiT-Small) on four target domains of VLCS and OfficeHome dataset. <p align="center"> <img src="https://github.com/maryam089/SDViT/blob/main/Figures/Attentions_DomainNet.png" > </p> Comparison of attention maps between the baseline ERM-ViT and our proposed ERM-SDViT (backbone: DeiT-Small) on six target domains of DomainNet dataset.Acknowledgment
The code is build on the top of DomainBed: a PyTorch suite containing benchmark datasets and algorithms for domain generalization, as introduced in In Search of Lost Domain Generalization. ViT Code is based on T2T, CVT, DeiT repository and TIMM library. We thank the authors for releasing their codes.
License
This source code is released under the MIT license, included here.