Awesome
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
This repository provides the official PyTorch implementation of our CVPR 2024 paper:
[<ins>Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models</ins>]Paper
Authors: <ins>Yabin Zhang</ins>, <ins>Wenjie Zhu</ins>, <ins>Hui Tang</ins>, <ins>Zhiyuan Ma</ins>, <ins>Kaiyang Zhou</ins>, <ins>Lei Zhang</ins>
Overview
This repository contains the implementation of DMN for image classification with a pre-trained CLIP. We consider four task settings:
- Zero-shot classification in a test-time adaptation manner
- Few-shot classification
- Training-free few-shot classification
- Out-of-distribution generalization
Prerequisites
Hardware
This implementation is for the single-GPU configuration. All experiments can be reproduced on a GPU with more than 10GB memory (e.g., 1080Ti)!
Environment
The code is tested on PyTorch 1.13.1.
Datasets
We suggest downloading all datasets to a root directory (${data_root}
), and renaming the directory of each dataset as suggested in ${ID_to_DIRNAME}
in ./data/datautils.py
. This would allow you to evaluate multiple datasets within the same run.
If this is not feasible, you could evaluate different datasets separately, and change the ${data_root}
accordingly in the bash script.
For zero/few-shot classification, we consider 11 datasets:
For out-of-distribution generalization, we consider 4 datasets:
Run DMN
We provide a simple bash script under ./scripts/run.sh
. You can modify the paths and other args in the script. One can easily reproduce all results by:
bash ./scripts/run.sh
For simplicity, we use set_id
to denote different datasets. A complete list of set_id
can be found in ${ID_to_DIRNAME}
in ./data/datautils.py
.
Main Results
Zero-shot Classification
<p align = "center"> <img src = "figures/zero-shot.png"> </p> <p align = "center"> </p>Few-shot Classification
<p align = "center"> <img src = "figures/few-shot.png"> </p> <p align = "center"> Few-shot classification results on 11 datasets with a VITB/16 image encoder. </p>Out-of-Distribution Generalization
<div align="center">Method | ImageNet(IN) | IN-A | IN-V2 | IN-R | IN-Sketch | Average | OOD Average |
---|---|---|---|---|---|---|---|
CLIP-RN50 | 58.16 | 21.83 | 51.41 | 56.15 | 33.37 | 44.18 | 40.69 |
Ensembled prompt | 59.81 | 23.24 | 52.91 | 60.72 | 35.48 | 46.43 | 43.09 |
CoOp | 63.33 | 23.06 | 55.40 | 56.60 | 34.67 | 46.61 | 42.43 |
CoCoOp | 62.81 | 23.32 | 55.72 | 57.74 | 34.48 | 46.81 | 42.82 |
TPT | 60.74 | 26.67 | 54.70 | 59.11 | 35.09 | 47.26 | 43.89 |
DMN-ZS | 63.87 | 28.57 | 56.12 | 61.44 | 39.84 | 49.97 | 46.49 |
Citation
If you find our code useful or our work relevant, please consider citing:
@inproceedings{zhang2024dual,
title={Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models},
author={Zhang, Yabin and Zhu, Wenjie and Tang, Hui and Ma, Zhiyuan and Zhou, Kaiyang and Zhang, Lei},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
year={2024}
}
Acknowledgements
We thank the authors of CoOp/CoCoOp and TPT for their open-source implementation and instructions on data preparation.