Home

Awesome

Word and Descriptor Soups ๐Ÿœ [CVPR 2024] [ArXiv]


Code in this repo uses code from multimodal prompt learning, which in turn uses code from Co-CoOp and CoOp.

โณ Installation


# Instructions borrowed from https://github.com/KaiyangZhou/Dassl.pytorch#installation

git clone https://github.com/KaiyangZhou/Dassl.pytorch.git
cd Dassl.pytorch/
pip install -r requirements.txt
python setup.py develop
cd ..

pip install open_clip_torch
pip install pytorch_metric_learning
data/
|-- caltech-101
|-- dtd
|-- eurosat
|-- fgvc_aircraft
|-- food-101
|-- imagenet
|-- imagenet-adversarial
|-- imagenet-rendition
|-- imagenet-sketch
|-- imagenetv2
|-- oxford_flowers
|-- oxford_pets
|-- stanford_cars
|-- sun397
|-- ucf101

Alternatively, follow the download instructions here (some dataset links are stale; may also need to reorganize the directory structure): installing datasets

Modify the following two lines in argparse_parameters.py to reflect where you have your data/ dir and where you want the pretrained CLIP weights to be cached (which could be many gigabytes)

parser.add_argument('--cache_dir', default = "", type =str) # set to directory where you want large pretrained model weights to be cached
parser.add_argument('--data_dir', default = "", type =str)  # set to parent directory of data/

๐Ÿœ Descriptor soups


(1) Generate Description Features

First, calculate the descriptor features on ImageNet. Use preprocess/generate_description_features.py. This python file reads from preprocess/descriptions.list, which is a sorted list of 4227 unique GPT descriptors. They begin with a space and end in a period. Currently, we use a pretrained model for these features.

Run: python preprocess/generate_description_features.py --dataset ImageNet

This will save the tuple of description strings, description features in cache/description_features__ViT-B-16_openai.tensor

(2) Calculate greedy descriptor soups

This needs to be done for each random seed of ImageNet training split!

Run:

python preprocess/get_greedy_descriptor_soup.py --dataset ImageNet --seed 1
python preprocess/get_greedy_descriptor_soup.py --dataset ImageNet --seed 2
python preprocess/get_greedy_descriptor_soup.py --dataset ImageNet --seed 3

This will save the greedily selected descriptors in cache/good_descriptions_seed1__ViT-B-16_openai.list as a list.

Example logs: example_logs/example_get_greedy_descriptor_soup_output.txt

Proceed to Zero-shot comparisons section for evaluation.

๐Ÿœ Word soups


(1) Get Word Features

preprocess/words.list contains 10,000 most common English words minus swear words. They have a space prepended. We can use the same preprocess/generate_description_features.py to generate the text features from individual words.

Run: python preprocess/generate_description_features.py --dataset ImageNet --descriptions preprocess/words.list --savename word_features

This will save the tuple or words and word features in cache/word_features__ViT-B-16_openai.tensor

(2) Calculate greedy word soups

This needs to be done for each random seed of ImageNet training split!

Run:

python preprocess/get_greedy_word_soup.py --dataset ImageNet --seed 1 --n_descriptors 8
python preprocess/get_greedy_word_soup.py --dataset ImageNet --seed 2 --n_descriptors 8
python preprocess/get_greedy_word_soup.py --dataset ImageNet --seed 3 --n_descriptors 8

This will save the greedily selected descriptors in cache/word_soup_descriptors_seed1__ViT-B-16_openai.list as a list.

Example logs: example_logs/example_get_greedy_word_soup_output.txt

Proceed to Zero-shot comparisons section for evaluation.

๐Ÿงช Baselines


Results are outputted in CSV format at the end of the experiment. You can copy and paste directly into a spreadsheet.

Zero-shot comparisons

For all ZS methods presented in Table 3 of the paper (Open-AI handcrafted ensemble, GPT, descriptor soup, token offest, word soup), run:

sh scripts/run_pt_eval.sh 0 ViT-B-16 openai 512

Example logs: example_logs/example_run_pt_eval_ViT-B-16_openai_output.txt

For WaffleCLIP with 16 members, run:

sh scripts/waffle_descriptors_eval.sh 16

Example logs: example_logs/example_waffle_descriptors_eval_output.txt

Few-shot OOD comparisons

These scripts train on 3 random splits of 16-shot ImageNet-1K. "XD Mean" stands for average test accuracy on 10 OOD ddatasets. "DG Mean" stands for average test accuracy on 4 domain-shifted versions of ImageNet. You can verify these results by running the indicated bash script and pasting the CSV-formatted results at the end of the output into a spreadsheet.

MethodCommand to runXD MeanDG Mean
CLIP-adapterscripts/run_adapter.sh 6e-3 ViT-B-16 51265.0258.12
bitfitscripts/bitfit.sh 1.25e-4 ViT-B-16 51266.0559.12
Cross Entropyscripts/run_ce.sh 2e-5 ViT-B-16 51266.8060.39
Cross Entropy + word soup + diversity lossscripts/run_ce_regularized.sh 0.25 1067.4361.32
ClipOODscripts/run_clipood.sh 2e-5 ViT-B-16 51266.5060.47
ClipOOD + word soup + diversity lossscripts/run_clipood_regularized.sh 0.25 1067.4261.23
CoOpscripts/run_coop.sh 8e-5 ViT-B-16 51266.5259.25
CoOp + word soup + diversity lossscripts/run_coop_regularized.sh 0.25 1067.3060.25
KgCoOpscripts/run_kgcoop.sh 4e-5 ViT-B-16 51266.1658.64
LoRAscripts/run_lora.sh 1e-5 ViT-B-16 51266.1957.93
MaPLescripts/run_maple.sh 0.025 ViT-B-16 51266.4459.32
MaPLe + word soup + diversity lossscripts/run_maple_regularized.sh66.6560.20
ProDAscripts/run_proda.sh 3.2e-4 ViT-B-16 51266.2358.83
ProGradscripts/run_prograd.sh 1.28e-3 ViT-B-16 51266.4858.96
ResBlock-adapterscripts/run_resblock_adapter.sh 2.5e-3 ViT-B-16 51265.5559.48
SSFscripts/run_ssf.sh 1e-4 ViT-B-16 51265.8658.44
VPTscripts/run_vpt_deep.sh 0.8 ViT-B-16 51265.1658.42

๐Ÿงช More experiments


Base to novel setting

First, generate features for each training dataset:

For descriptor features:

for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
do
  python preprocess/generate_description_features.py --dataset $dataset --subsample_classes base
done

For word features:

for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
do
  python preprocess/generate_description_features.py --dataset $dataset --descriptions words.list --savename word_features --subsample_classes base
done

To get greedy descriptor soup:

for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
do
  sh scripts/ablations/run_get_greedy_descriptor_soup.sh $dataset
done

To get greedy word soup:

for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
do
  sh scripts/ablations/run_get_greedy_word_soup.sh $dataset
done

Then run training using provided bash scripts, example:

sh scripts/run_ce_with_eval.btn.sh 5e-05 > run_ce_with_eval.btn.sh_5e-05.o

See any bash script called scripts/*.btn.sh.

CoOp soft descriptor ensemble baseline

Run scripts/ablations/coop_soft_descriptor_ensemble.sh which logs in train_softd.o and outputs

These are list of 8 soft descriptors.

To evaluate: (reference scripts/ablations/run_soft.sh)

More baselines

Many more baselines in the scripts/ablations folder. Run these at your pleasure.