

Exploring Efficient Few-shot Adaptation for Vision Transformers

This is an official implementation in Pytorch of eTT. Our paper is available at link.

Data Preparation

This repo adopts the same data structure as TSA. We simply quote the original data preparation here, thank the authors of TSA for the contribution.

Training & Inference

To run this code you need to get a DINO-pretrained network weight. We recomment to re-run the original DINO using the meta-train set of ImageNet as training data. To do this, you need to clone the DINO repo and copy all files in pretrain_code_snippet in the DINO folder and run the training script. In detail,

git clone https://github.com/facebookresearch/dino.git

cp -rf pretrain_code_snippet/* dino/

python -m torch.distributed.launch --nproc_per_node=8 main_dino_metadataset.py --arch {ARCH} --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

This is the main setting in our paper. Technically speaking you can also use the 1000-class DINO weight provided in the original repo for the experiments.

After getting the pretrained weight you can run the meta-testing as follow:

python test_extractor_pa_vit_prefix.py --data.test ilsvrc_2012 omniglot aircraft cu_birds dtd quickdraw fungi vgg_flower traffic_sign mscoco --model.ckpt {WEIGHT PATH}

The code adopts ViT-small as default backbone structure can be modified according to your requirement.


We modify our code from TSA.