

Diversity-Aware Meta Visual Prompting (CVPR 2023)

This repository provides the official PyTorch implementation of the following conference paper:

Diversity-Aware Meta Visual Prompting (CVPR 2023) <br> Qidong Huang<sup>1</sup>, Xiaoyi Dong<sup>1</sup>, Dongdong Chen<sup>2</sup>, Weiming Zhang<sup>1</sup>, Feifei Wang<sup>1</sup>, Gang Hua<sup>3</sup>, Nenghai Yu<sup>1</sup> <br> <sup>1</sup>University of Science and Technology of China, <sup>2</sup>Microsoft Cloud AI, <sup>3</sup>Wormpex AI Research <br>

Environment Setup

This code is tested with Python3.8, Pytorch = 1.11 and CUDA = 11.3, requiring the following dependencies:

To setup a conda environment, please use the following instructions:

conda env create -f environment.yaml
conda activate dam_vp

Dataset Preparation

The Fine-Grained Visual Classification (FGVC) datasets can be downloaded in VPT repo. The Fru92 and Veg200 datasets can be downloaded at VegFru. Other datasets are all avaliable at torchvision.

python data_utils/vtab_prep.py

For more tips about how to download VTAB-1k, please refer to VTAB_SETUP.md.

The overall directory structure should be:

│   ├──FGVC/
│   │   ├──CUB_200_2011/
│   │   ├──OxfordFlower/
│   │   ├──Stanford-cars/
│   │   ├──Stanford-dogs/
│   │   ├──nabirds/
│   ├──VTAB/
│   │   ├──.......
│   ├──finegrained_dataset/
│   │   ├──vegfru-dataset/
│   ├──torchvision_dataset/
│   │   ├──.......

Pre-trained Model Preparation

The used pre-trained vision models are detailed in Table 8 of our paper. Their checkpoints can be downloaded here:

<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="bottom">Backbone</th> <th valign="bottom">Pre-trained Objective</th> <th valign="bottom">Pre-trained Dataset</th> <th valign="bottom">Download</th> <th valign="bottom">md5sum</th> <!-- TABLE BODY --> <tr><td align="left">ViT-B/16</td> <td align="center">Supervised</td> <td align="center">ImageNet-1k</td> <td align="center"><a href="https://drive.google.com/file/d/1_cunej-ZSB58ngtOW62mh0GxOFoQvnjY/view?usp=sharing">Download</a></td> <td align="center"><tt>-</tt></td> </tr> <tr><td align="left">ViT-B/16</td> <td align="center">Supervised</td> <td align="center">ImageNet-22k</td> <td align="center"><a href="https://drive.google.com/file/d/1zvIqdml4KVArPuWspoHKU7a6e0uAunF8/view?usp=sharing">Download</a></td> <td align="center"><tt>-</tt></td> </tr> <tr><td align="left">ViT-B/16</td> <td align="center">CLIP</td> <td align="center">400M Web Data</td> <td align="center"><a href="https://openai.com/research/clip">Download</a></td> <td align="center"><tt>-</tt></td> </tr> <tr><td align="left">Swin-B</td> <td align="center">Supervised</td> <td align="center">ImageNet-22k</td> <td align="center"><a href="https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window7_224_22k.pth">Download</a></td> <td align="center"><tt>bf9cc1</tt></td> </tr> <tr><td align="left">ViT-B/16</td> <td align="center">MoCo v3</td> <td align="center">ImageNet-1k</td> <td align="center"><a href="https://drive.google.com/file/d/1w_7CVKKlRq_VT-M6-aYFu1UlrjMxgXGA/view?usp=sharing">Download</a></td> <td align="center"><tt>-</tt></td> </tr> <tr><td align="left">ResNet-50</td> <td align="center">Supervised</td> <td align="center">ImageNet-1k</td> <td align="center"><a href="https://pytorch.org/vision/stable/models.html">Download</a></td> <td align="center"><tt>-</tt></td> </tr> </tbody></table>

Meta Prompt Initialization

The trained meta prompts are available at here, you can directly download these checkpoints and store them at ./meta-training/checkpoints/. Also, you can implement the meta training of visual prompts by yourself. The following instructions will be helpful.

cd meta-training/
# if prompting on vit-b-1k
python main_hf.py --base_dir /your/path/to/dataset/ --pretrained_model vit-b-1k --meta_lr 0.5 --update_lr 0.5 --update_step 4 --meta_step_size 0.5 --test_dataset oxford-flowers
# if prompting on clip-vit-b
python main_clip.py --base_dir /your/path/to/dataset/  --pretrained_model clip-vit-b --meta_lr 1.0 --update_lr 1.0 --update_step 4 --meta_step_size 0.5
cd meta-training/
# if prompting on vit-b-22k
python main_ht.py --base_dir /your/path/to/dataset/ --pretrained_model vit-b-22k --meta_lr 1.0 --update_lr 1.0 --update_step 4 --meta_step_size 0.5 --weight_decay 1e-4  --test_dataset oxford-flowers
# if prompting on swin-b-22k
python main_ht.py --base_dir /your/path/to/dataset/ --pretrained_model swin-b-22k --meta_lr 0.5 --update_lr 0.5 --update_step 4 --meta_step_size 0.5 --weight_decay 1e-4
# if prompting on moco-v3-b-1k
python main_ht.py --base_dir /your/path/to/dataset/ --pretrained_model moco-v3-b-1k --meta_lr 0.5 --update_lr 0.5 --update_step 4 --meta_step_size 0.5 --weight_decay 1e-4
# if prompting on resnet50-1k
python main_ht.py --base_dir /your/path/to/dataset/ --pretrained_model resnet50-1k --meta_lr 0.5 --update_lr 0.5 --update_step 4 --meta_step_size 0.5 --weight_decay 1e-4

Diversity-Aware Prompting

With the meta trained visual prompt, we can adapt pretrained vision models to unseen vision datasets. The hyper-parameter configurations can be found in Table 13 and Table 14 of our paper.

cd task_adapting/
# if prompting on vit-b-1k
python main.py --base_dir /your/path/to/dataset/ --pretrained_model vit-b-1k --adapt_method prompt_wo_head --test_dataset /select/one/dataset/ --epochs 50 --lr /learning/rate/ --weight_decay /weight/decay/rate/ --checkpoint_dir ../meta-training/checkpoints/vit-b-1k-wo-head.pth
# if prompting on clip-vit-b
python main_clip.py --base_dir /your/path/to/dataset/ --pretrained_model clip-vit-b --adapt_method prompt_wo_head --test_dataset /select/one/dataset/ --epochs 50 --lr /learning/rate/ --weight_decay /weight/decay/rate/ --checkpoint_dir ../meta-training/checkpoints/clip-vit-b-wo-head.pth
cd task_adapting/
# if prompting on vit-b-22k
python main.py --base_dir /your/path/to/dataset/ --pretrained_model vit-b-22k --adapt_method ours_with_head --test_dataset /select/one/dataset/ --epochs 50 --lr /learning/rate/ --weight_decay /weight/decay/rate/ --checkpoint_dir ../meta-training/checkpoints/vit-b-22k-w-head.pth
# if prompting on swin-b-22k
python main.py --base_dir /your/path/to/dataset/ --pretrained_model swin-b-22k --adapt_method ours_with_head --test_dataset /select/one/dataset/ --epochs 50 --lr /learning/rate/ --weight_decay /weight/decay/rate/ --checkpoint_dir ../meta-training/checkpoints/swin-b-22k-w-head.pth
# if prompting on moco-v3-b-1k
python main.py --base_dir /your/path/to/dataset/ --pretrained_model moco-v3-b-1k --adapt_method ours_with_head --test_dataset /select/one/dataset/ --epochs 50 --lr /learning/rate/ --weight_decay /weight/decay/rate/ --checkpoint_dir ../meta-training/checkpoints/moco-v3-b-1k-w-head.pth
# if prompting on resnet50-1k
python main.py --base_dir /your/path/to/dataset/ --pretrained_model resnet50-1k --adapt_method ours_with_head --test_dataset /select/one/dataset/ --epochs 50 --lr /learning/rate/ --weight_decay /weight/decay/rate/ --checkpoint_dir ../meta-training/checkpoints/resnet50-1k-w-head.pth


This repo is partially based on VP and VPT. Thanks for their impressive works!


If you find this work useful for your research, please cite our paper:

  title={Diversity-Aware Meta Visual Prompting},
  author={Qidong Huang and Xiaoyi Dong and Dongdong Chen and Weiming Zhang and Feifei Wang and Gang Hua and Nenghai Yu},
  journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},


The code is released under MIT License (see LICENSE file for details).