Home

Awesome

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

This repository provides the official PyTorch implementation of our CVPR 2024 paper:

[<ins>Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models</ins>]Paper
Authors: <ins>Yabin Zhang</ins>, <ins>Wenjie Zhu</ins>, <ins>Hui Tang</ins>, <ins>Zhiyuan Ma</ins>, <ins>Kaiyang Zhou</ins>, <ins>Lei Zhang</ins>

Overview

This repository contains the implementation of DMN for image classification with a pre-trained CLIP. We consider four task settings:

<p align = "center"> <img src = "figures/acc_gflops.png"> </p> <p align = "center"> Results on ImageNet dataset under different task settings. </p> <p align = "center"> <img src = "figures/framework.png"> </p> <p align = "center"> The overall framework of our DMN. </p>

Prerequisites

Hardware

This implementation is for the single-GPU configuration. All experiments can be reproduced on a GPU with more than 10GB memory (e.g., 1080Ti)!

Environment

The code is tested on PyTorch 1.13.1.

Datasets

We suggest downloading all datasets to a root directory (${data_root}), and renaming the directory of each dataset as suggested in ${ID_to_DIRNAME} in ./data/datautils.py. This would allow you to evaluate multiple datasets within the same run.
If this is not feasible, you could evaluate different datasets separately, and change the ${data_root} accordingly in the bash script.

For zero/few-shot classification, we consider 11 datasets:

For out-of-distribution generalization, we consider 4 datasets:

Run DMN

We provide a simple bash script under ./scripts/run.sh. You can modify the paths and other args in the script. One can easily reproduce all results by:

bash ./scripts/run.sh

For simplicity, we use set_id to denote different datasets. A complete list of set_id can be found in ${ID_to_DIRNAME} in ./data/datautils.py.

Main Results

Zero-shot Classification

<p align = "center"> <img src = "figures/zero-shot.png"> </p> <p align = "center"> </p>

Few-shot Classification

<p align = "center"> <img src = "figures/few-shot.png"> </p> <p align = "center"> Few-shot classification results on 11 datasets with a VITB/16 image encoder. </p>

Out-of-Distribution Generalization

<div align="center">
MethodImageNet(IN)IN-AIN-V2IN-RIN-SketchAverageOOD Average
CLIP-RN5058.1621.8351.4156.1533.3744.1840.69
Ensembled prompt59.8123.2452.9160.7235.4846.4343.09
CoOp63.3323.0655.4056.6034.6746.6142.43
CoCoOp62.8123.3255.7257.7434.4846.8142.82
TPT60.7426.6754.7059.1135.0947.2643.89
DMN-ZS63.8728.5756.1261.4439.8449.9746.49
</div> <br />

Citation

If you find our code useful or our work relevant, please consider citing:

@inproceedings{zhang2024dual,
  title={Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models},
  author={Zhang, Yabin and Zhu, Wenjie and Tang, Hui and Ma, Zhiyuan and Zhou, Kaiyang and Zhang, Lei},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  year={2024}
}

Acknowledgements

We thank the authors of CoOp/CoCoOp and TPT for their open-source implementation and instructions on data preparation.