Awesome

Semi-ViT: Semi-supervised Vision Transformers at Scale

This is a PyTorch implementation of the paper Semi-ViT. It is a state-of-the-art semi-supervised learning of vision transformers.

If you use the code/model/results of this repository please cite:

@inproceedings{cai2022semi,
  author  = {Zhaowei Cai and Avinash Ravichandran and Paolo Favaro and Manchen Wang and Davide Modolo and Rahul Bhotika and Zhuowen Tu and Stefano Soatto},
  title   = {Semi-supervised Vision Transformers at Scale},
  booktitle = {NeurIPS},
  Year  = {2022}
}

Install

First, install PyTorch and torchvision. We have tested on version of 1.7.1, but newer versions should also be working.

$ conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch

Also install other dependencies, e.g.,

$ pip install timm==0.4.5

Data Preparation

Assume ImageNet folder is ~/data/imagenet/, install ImageNet dataset following the official PyTorch ImageNet training code, with the standard data folder structure for the torchvision datasets.ImageFolder. Please download the ImageNet index files for semi-supervised learning experiments. The file structure should look like:

$ tree data
imagenet
├── train
│   ├── class1
│   │   └── *.jpeg
│   ├── class2
│   │   └── *.jpeg
│   └── ...
├── val
│   ├── class1
│   │   └── *.jpeg
│   ├── class2
│   │   └── *.jpeg
│   └── ...
└── indexes
    └── *_index.csv

Please also download the MAE self-pretrained weights, and move them to the folder of pretrain_weights.

Supervised Finetuning

The supervised finetuning instruction is in FINETUNE.md.

Semi-supervised Finetuning

The semi-supervised finetuning instruction is in SEMIVIT.md.

Results

If the model is self-pretrained, the results would be close to the following (with some minor variance):

model	method	acc@1% IN	acc@10% IN	acc@100% IN
ViT-Base	Finetune	57.4	73.7	83.7
ViT-Base	Semi-ViT	71.0	79.7	-
ViT-Large	Finetune	67.1	79.2	86.0
ViT-Large	Semi-ViT	77.3	83.3	-
ViT-Huge	Finetune	71.5	81.4	86.9
ViT-Huge	Semi-ViT	80.0	84.3	-

If the model is not self-pretrained, the results would be close to the following (with some minor variance):

model	method	acc@10% IN
ViT-Small	Finetune	56.2
ViT-Small	Semi-ViT	70.9
ViT-Base	Finetune	57.0
ViT-Base	Semi-ViT	73.5
ConvNeXT-Tiny	Finetune	61.2
ConvNeXT-Tiny	Semi-ViT	74.1
ConvNeXT-Small	Finetune	64.1
ConvNeXT-Small	Semi-ViT	75.1

License

This project is under the Apache-2.0 license. See LICENSE for details.