Home

Awesome

Self-Distilled Vision Transformer for Domain Generalization (ACCV'22 -- Oral)

Maryam Sultana, Muzammal Naseer, Muhammad Haris Khan, Salman Khan, and Fahad Shahbaz Khan

paper arXiv Poster Slides Video

Abstract: In the recent past, several domain generalization (DG) methods have been proposed, showing encouraging performance, however, almost all of them build on convolutional neural networks (CNNs). There is little to no progress on studying the DG performance of vision transformers (ViTs), which are challenging the supremacy of CNNs on standard benchmarks, often built on i.i.d assumption. This renders the real-world deployment of ViTs doubtful. In this paper, we attempt to explore ViTs towards addressing the DG problem. Similar to CNNs, ViTs also struggle in out-of-distribution scenarios and the main culprit is overfitting to source domains. Inspired by the modular architecture of ViTs, we propose a simple DG approach for ViTs, coined as self-distillation for ViTs. It reduces the overfitting of source domains by easing the learning of input-output mapping problem through curating non-zero entropy supervisory signals for intermediate transformer blocks. Further, it does not introduce any new parameters and can be seamlessly plugged into the modular composition of different ViTs. We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets. Moreover, we report favorable performance against recent state-of-the-art DG methods. Our code along with pre-trained models are made available publicly.

State-of-the-Art Vision Transformers for Domain Generalization

PACS

<hr>

VLCS

<hr>

OfficeHome

<hr>

TerraIncognita

<hr>

DomainNet

<hr>

Citation

If you find our work useful. Please consider giving a star :star: and a citation.

@InProceedings{Sultana_2022_ACCV,
    author    = {Sultana, Maryam and Naseer, Muzammal and Khan, Muhammad Haris and Khan, Salman and Khan, Fahad Shahbaz},
    title     = {Self-Distilled Vision Transformer for Domain Generalization},
    booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
    month     = {December},
    year      = {2022},
    pages     = {3068-3085}
}

Contents

  1. Highlights
  2. Installation
  3. Datasets
  4. Training Self-Distilled Vision Transformer
  5. Pretrained Models
  6. Evaluating for Domain Generalization
  7. Attention Visualizations

Highlights

<p align="center"> <img src="https://github.com/maryam089/SDViT/blob/main/Figures/blockwise_accuracy_git.png" > </p> In the Fig. above, we plot the block-wise accuracy of baseline (ERM-ViT) and our method (ERM-SDViT). Random sub-model distillation improves the accuracy of all blocks, in particular, the improvement is more pronounced for the earlier blocks. Besides later blocks, it also encourages earlier blocks to bank on transferable representations, yet discriminative representations. Since these earlier blocks manifest multiple discriminative feature pathways, we believe that they better facilitate the overall model towards capturing the semantics of the object class.

Installation

To install conda env with conda, run the following command in your terminal:

conda env create -n ViT_DGbed --file ViT_DGbed.yml

Activate the conda environment:

conda activate ViT_DGbed

Datasets

python3 -m domainbed.scripts.download \
       --data_dir=./domainbed/data --dataset pacs

Note: for downloading other datasets change --dataset pacs with other datasets (e.g., vlcs, office_home, terra_incognita, domainnet).

Model selection criteria

We computed results on the following model selection

Training Self-Distilled Vision Transformer

Launching a sweep on ViT Baselines:

./Baseline_sweep.sh

Launching a sweep on SDViT Model:

./Grid_Search_sweep.sh

Note: For above all commands change --dataset PACS for training on other datasets such as OfficeHome, VLCS, TerraIncognita and DomainNet and backbone to CVTSmall or T2T14.

Pretrained Models

Pretrained ViT models:

DatasetBaseline (ERM-ViT)Ours (ERM-SDViT)
PACSLinkLink
VLCSLinkLink
OfficeHomeLinkLink
TerraIncognitaLinkLink
DomainNetLinkLink
<hr /> <hr />

Evaluating for Domain Generalization

To view the results using our pre-trained models:

python -m domainbed.scripts.collect_results\
       --input_dir=/Results/Dataset/Model/Backbone/ --get_recursively True

Note: Replace the text with dataset and model names (e.g: Results/PACS/ERM-ViT/DeiT-Small/ and so on....) to view results on various models. Test-Time Classifier Adjuster (T3A) is exploited in our proposed method as a complimentary approach, for details please refer to following instructions: T3A

Results:

  1. Accuracy on three Backbone Networks using PACS dataset. results
  2. Accuracy on three Backbone Networks using five benchmark datasets in comparison with DG SOTA. results

Attention Visualizations

<p align="center"> <img src="https://github.com/maryam089/SDViT/blob/main/Figures/PACS_git.png" > </p> Comparison of attention maps between the baseline ERM-ViT and our proposed ERM-SDViT (backbone: DeiT-Small) on four target domains of PACS dataset. <p align="center"> <img src="https://github.com/maryam089/SDViT/blob/main/Figures/Attentions_VLCS_OH.png" > </p> Comparison of attention maps between the baseline ERM-ViT and our proposed ERM-SDViT (backbone: DeiT-Small) on four target domains of VLCS and OfficeHome dataset. <p align="center"> <img src="https://github.com/maryam089/SDViT/blob/main/Figures/Attentions_DomainNet.png" > </p> Comparison of attention maps between the baseline ERM-ViT and our proposed ERM-SDViT (backbone: DeiT-Small) on six target domains of DomainNet dataset.

Acknowledgment

The code is build on the top of DomainBed: a PyTorch suite containing benchmark datasets and algorithms for domain generalization, as introduced in In Search of Lost Domain Generalization. ViT Code is based on T2T, CVT, DeiT repository and TIMM library. We thank the authors for releasing their codes.

License

This source code is released under the MIT license, included here.