Home

Awesome

Semi-ViT: Semi-supervised Vision Transformers at Scale

This is a PyTorch implementation of the paper Semi-ViT. It is a state-of-the-art semi-supervised learning of vision transformers.

If you use the code/model/results of this repository please cite:

@inproceedings{cai2022semi,
  author  = {Zhaowei Cai and Avinash Ravichandran and Paolo Favaro and Manchen Wang and Davide Modolo and Rahul Bhotika and Zhuowen Tu and Stefano Soatto},
  title   = {Semi-supervised Vision Transformers at Scale},
  booktitle = {NeurIPS},
  Year  = {2022}
}

Install

First, install PyTorch and torchvision. We have tested on version of 1.7.1, but newer versions should also be working.

$ conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch

Also install other dependencies, e.g.,

$ pip install timm==0.4.5

Data Preparation

Assume ImageNet folder is ~/data/imagenet/, install ImageNet dataset following the official PyTorch ImageNet training code, with the standard data folder structure for the torchvision datasets.ImageFolder. Please download the ImageNet index files for semi-supervised learning experiments. The file structure should look like:

$ tree data
imagenet
├── train
│   ├── class1
│   │   └── *.jpeg
│   ├── class2
│   │   └── *.jpeg
│   └── ...
├── val
│   ├── class1
│   │   └── *.jpeg
│   ├── class2
│   │   └── *.jpeg
│   └── ...
└── indexes
    └── *_index.csv

Please also download the MAE self-pretrained weights, and move them to the folder of pretrain_weights.

Supervised Finetuning

The supervised finetuning instruction is in FINETUNE.md.

Semi-supervised Finetuning

The semi-supervised finetuning instruction is in SEMIVIT.md.

Results

If the model is self-pretrained, the results would be close to the following (with some minor variance):

modelmethodacc@1% INacc@10% INacc@100% IN
ViT-BaseFinetune57.473.783.7
ViT-BaseSemi-ViT71.079.7-
ViT-LargeFinetune67.179.286.0
ViT-LargeSemi-ViT77.383.3-
ViT-HugeFinetune71.581.486.9
ViT-HugeSemi-ViT80.084.3-

If the model is not self-pretrained, the results would be close to the following (with some minor variance):

modelmethodacc@10% IN
ViT-SmallFinetune56.2
ViT-SmallSemi-ViT70.9
ViT-BaseFinetune57.0
ViT-BaseSemi-ViT73.5
ConvNeXT-TinyFinetune61.2
ConvNeXT-TinySemi-ViT74.1
ConvNeXT-SmallFinetune64.1
ConvNeXT-SmallSemi-ViT75.1

License

This project is under the Apache-2.0 license. See LICENSE for details.