Awesome
Word and Descriptor Soups ๐ [CVPR 2024] [ArXiv]
Code in this repo uses code from multimodal prompt learning, which in turn uses code from Co-CoOp and CoOp.
โณ Installation
- Install dassl library and other requirements.
# Instructions borrowed from https://github.com/KaiyangZhou/Dassl.pytorch#installation
git clone https://github.com/KaiyangZhou/Dassl.pytorch.git
cd Dassl.pytorch/
pip install -r requirements.txt
python setup.py develop
cd ..
pip install open_clip_torch
pip install pytorch_metric_learning
- Create a directory somewhere called
data/
. Download all 15 zip files from this shared Google Drive and unzip them intodata/
. The resulting file tree should look like:
data/
|-- caltech-101
|-- dtd
|-- eurosat
|-- fgvc_aircraft
|-- food-101
|-- imagenet
|-- imagenet-adversarial
|-- imagenet-rendition
|-- imagenet-sketch
|-- imagenetv2
|-- oxford_flowers
|-- oxford_pets
|-- stanford_cars
|-- sun397
|-- ucf101
Alternatively, follow the download instructions here (some dataset links are stale; may also need to reorganize the directory structure): installing datasets
Modify the following two lines in argparse_parameters.py
to reflect where you have your data/
dir and where you want the pretrained CLIP weights to be cached (which could be many gigabytes)
parser.add_argument('--cache_dir', default = "", type =str) # set to directory where you want large pretrained model weights to be cached
parser.add_argument('--data_dir', default = "", type =str) # set to parent directory of data/
๐ Descriptor soups
(1) Generate Description Features
First, calculate the descriptor features on ImageNet.
Use preprocess/generate_description_features.py
.
This python file reads from preprocess/descriptions.list
,
which is a sorted list of 4227 unique GPT descriptors. They begin with a space and end in a period.
Currently, we use a pretrained model for these features.
Run: python preprocess/generate_description_features.py --dataset ImageNet
This will save the tuple of description strings,
description features in cache/description_features__ViT-B-16_openai.tensor
(2) Calculate greedy descriptor soups
This needs to be done for each random seed of ImageNet training split!
Run:
python preprocess/get_greedy_descriptor_soup.py --dataset ImageNet --seed 1
python preprocess/get_greedy_descriptor_soup.py --dataset ImageNet --seed 2
python preprocess/get_greedy_descriptor_soup.py --dataset ImageNet --seed 3
This will save the greedily selected descriptors in cache/good_descriptions_seed1__ViT-B-16_openai.list
as a list.
Example logs: example_logs/example_get_greedy_descriptor_soup_output.txt
Proceed to Zero-shot comparisons section for evaluation.
๐ Word soups
(1) Get Word Features
preprocess/words.list
contains 10,000 most common English words minus swear words. They have a space prepended. We can use the same preprocess/generate_description_features.py
to generate the text features from individual words.
Run: python preprocess/generate_description_features.py --dataset ImageNet --descriptions preprocess/words.list --savename word_features
This will save the tuple or words and word features in cache/word_features__ViT-B-16_openai.tensor
(2) Calculate greedy word soups
This needs to be done for each random seed of ImageNet training split!
Run:
python preprocess/get_greedy_word_soup.py --dataset ImageNet --seed 1 --n_descriptors 8
python preprocess/get_greedy_word_soup.py --dataset ImageNet --seed 2 --n_descriptors 8
python preprocess/get_greedy_word_soup.py --dataset ImageNet --seed 3 --n_descriptors 8
This will save the greedily selected descriptors in cache/word_soup_descriptors_seed1__ViT-B-16_openai.list
as a list.
Example logs: example_logs/example_get_greedy_word_soup_output.txt
Proceed to Zero-shot comparisons section for evaluation.
๐งช Baselines
Results are outputted in CSV format at the end of the experiment. You can copy and paste directly into a spreadsheet.
Zero-shot comparisons
For all ZS methods presented in Table 3 of the paper (Open-AI handcrafted ensemble, GPT, descriptor soup, token offest, word soup), run:
sh scripts/run_pt_eval.sh 0 ViT-B-16 openai 512
Example logs: example_logs/example_run_pt_eval_ViT-B-16_openai_output.txt
For WaffleCLIP with 16 members, run:
sh scripts/waffle_descriptors_eval.sh 16
Example logs: example_logs/example_waffle_descriptors_eval_output.txt
Few-shot OOD comparisons
These scripts train on 3 random splits of 16-shot ImageNet-1K. "XD Mean" stands for average test accuracy on 10 OOD ddatasets. "DG Mean" stands for average test accuracy on 4 domain-shifted versions of ImageNet. You can verify these results by running the indicated bash script and pasting the CSV-formatted results at the end of the output into a spreadsheet.
Method | Command to run | XD Mean | DG Mean |
---|---|---|---|
CLIP-adapter | scripts/run_adapter.sh 6e-3 ViT-B-16 512 | 65.02 | 58.12 |
bitfit | scripts/bitfit.sh 1.25e-4 ViT-B-16 512 | 66.05 | 59.12 |
Cross Entropy | scripts/run_ce.sh 2e-5 ViT-B-16 512 | 66.80 | 60.39 |
Cross Entropy + word soup + diversity loss | scripts/run_ce_regularized.sh 0.25 10 | 67.43 | 61.32 |
ClipOOD | scripts/run_clipood.sh 2e-5 ViT-B-16 512 | 66.50 | 60.47 |
ClipOOD + word soup + diversity loss | scripts/run_clipood_regularized.sh 0.25 10 | 67.42 | 61.23 |
CoOp | scripts/run_coop.sh 8e-5 ViT-B-16 512 | 66.52 | 59.25 |
CoOp + word soup + diversity loss | scripts/run_coop_regularized.sh 0.25 10 | 67.30 | 60.25 |
KgCoOp | scripts/run_kgcoop.sh 4e-5 ViT-B-16 512 | 66.16 | 58.64 |
LoRA | scripts/run_lora.sh 1e-5 ViT-B-16 512 | 66.19 | 57.93 |
MaPLe | scripts/run_maple.sh 0.025 ViT-B-16 512 | 66.44 | 59.32 |
MaPLe + word soup + diversity loss | scripts/run_maple_regularized.sh | 66.65 | 60.20 |
ProDA | scripts/run_proda.sh 3.2e-4 ViT-B-16 512 | 66.23 | 58.83 |
ProGrad | scripts/run_prograd.sh 1.28e-3 ViT-B-16 512 | 66.48 | 58.96 |
ResBlock-adapter | scripts/run_resblock_adapter.sh 2.5e-3 ViT-B-16 512 | 65.55 | 59.48 |
SSF | scripts/run_ssf.sh 1e-4 ViT-B-16 512 | 65.86 | 58.44 |
VPT | scripts/run_vpt_deep.sh 0.8 ViT-B-16 512 | 65.16 | 58.42 |
๐งช More experiments
Base to novel setting
First, generate features for each training dataset:
For descriptor features:
for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
do
python preprocess/generate_description_features.py --dataset $dataset --subsample_classes base
done
For word features:
for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
do
python preprocess/generate_description_features.py --dataset $dataset --descriptions words.list --savename word_features --subsample_classes base
done
To get greedy descriptor soup:
for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
do
sh scripts/ablations/run_get_greedy_descriptor_soup.sh $dataset
done
To get greedy word soup:
for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
do
sh scripts/ablations/run_get_greedy_word_soup.sh $dataset
done
Then run training using provided bash scripts, example:
sh scripts/run_ce_with_eval.btn.sh 5e-05 > run_ce_with_eval.btn.sh_5e-05.o
See any bash script called scripts/*.btn.sh
.
CoOp soft descriptor ensemble baseline
Run scripts/ablations/coop_soft_descriptor_ensemble.sh
which logs in train_softd.o
and outputs
cache/soft_descriptors/random_8_10_token_8_ensemble/8_random_10_token_word_chains_seed1.list_e0.soft
cache/soft_descriptors/random_8_10_token_8_ensemble/8_random_10_token_word_chains_seed2.list_e0.soft
cache/soft_descriptors/random_8_10_token_8_ensemble/8_random_10_token_word_chains_seed3.list_e0.soft
These are list of 8 soft descriptors.
To evaluate:
(reference scripts/ablations/run_soft.sh
)
More baselines
Many more baselines in the scripts/ablations
folder. Run these at your pleasure.