Home

Awesome

Code for ICML 2024: Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

Zhuo Huang<sup>1</sup>, Chang Liu<sup>2</sup>, Yinpeng Dong<sup>3</sup>, Hang Su<sup>3, 4</sup>, Shibao Zheng<sup>2</sup>, Tongliang Liu<sup>1</sup>

<sup>1</sup>The University of Sydney, <sup>2</sup>Shanghai JiaoTong University, <sup>3</sup>Tsinghua University, <sup>4</sup>Peng Cheng Laboratory

Although vision models such as Contrastive Language-Image Pre-Training (CLIP) show impressive generalization performance, their zero-shot robustness is still limited under Out-of-Distribution (OOD) scenarios without fine-tuning. Instead of undesirably providing human supervision as commonly done, it is possible to take advantage of Multi-modal Large Language Models (MLLMs) that hold powerful visual understanding abilities. However, MLLMs are shown to struggle with vision problems due to the incompatibility of tasks, thus hindering their utilization. In this paper, we propose to effectively leverage MLLMs to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models. By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner. To solve the incompatibility issue, we propose a novel Denoising In-Context Learning (DICL) strategy to align vision tasks with MLLMs. Concretely, by estimating a transition matrix that captures the probability of one class being confused with another, an instruction containing a correct exemplar and an erroneous one from the most probable noisy class can be constructed. Such an instruction can help any MLLMs with ICL ability to detect and rectify incorrect predictions of vision models. Through extensive experiments on ImageNet, WILDS, DomainBed, and other OOD datasets, we carefully validate the quantitative and qualitative effectiveness of our method.

Overview

Overview

MLLMs

MMICL: MMICL-FLANT5XXL; MMICL-Tiny

Otter: OTTER-Image-LLaMA7B-LA-InContext

git clone https://github.com/Luodian/Otter.git

To avoid top-level reference, please change the line 19 - line 21 in Otter/src/otter_ai/models/otter/modeling_otter.py as:

from falcon.modelling_RW import RWForCausalLM
from mpt.modeling_mpt import MPTForCausalLM
from mpt_redpajama.mosaic_gpt import MosaicGPT

Vision Models

CLIP: RN50, RN101, RN50x4, RN50x16, RN50x64, ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px, ViT-g, ViT-G, etc. Please see CLIP from openai and CLIP from openclip.

Capabilities

Machine Vision Therapy can enhance the visual robustness on various circumstances, for example:

Setup

We conduct our experiment with Anaconda3. If you have installed Anaconda3, then create your own environment and install python packages as follows:

pip install -r requirements.txt

Datasets

We use various ID and OOD datasets, including ImageNet, ImageNet-A, ImageNet-R, ImageNet-Sketch, ImageNet-V2, ImageNet-V, ImageNet-C, PACS, VLCS, OfficeHome, DomainNet, iWildCam, CelebA, Spawrious, CIFAR10/100, and MNIST.

To specify a dataset:

--dataset: specify a dataset from ['mnist', 'cifar10', 'iwildcam', 'celebA', 'imagenet', 'cifar100', 'domainbed']

Further, for ImageNet-based datasets and DomainBed-based datasets, please set:

--chosen_name: specify a ImageNet-based dataset from ['ImageNet', 'ImageNetV2', 'ImageNetA', 'ImageNetR', 'ImageNetSketch', 'ImageNetV', 'ImageNetC', 'domainbed'] or a DomainBed-based dataset from ['PACS', 'VLCS', 'OfficeHome', 'DomainNet']

Moreover, there are several domains in each dataset from DomainBed, so in this case, you need to set the target domain by:

--target: choose one target domain.

ImageNet-based dataset

To download ImageNet-based datasets, please follow the instructions below:

export DATA_LOCATION=~/data
cd $DATA_LOCATION

ImageNet

ImageNet-A

wget https://people.eecs.berkeley.edu/~hendrycks/imagenet-a.tar
tar -xvzf imagenet-a.tar
rm imagenet-a.tar

ImageNet-R

wget https://people.eecs.berkeley.edu/~hendrycks/imagenet-r.tar
tar -xvzf imagenet-r.tar
rm imagenet-r.tar

ImageNet-Sketch

Download links:

ImageNet-V2

wget https://s3-us-west-2.amazonaws.com/imagenetv2public/imagenetv2-matched-frequency.tar.gz
tar -xvf imagenetv2-matched-frequency.tar.gz
rm imagenetv2-matched-frequency.tar.gz

ImageNet-V

Download links:

ImageNet-C

DomainBed

iWildCam from WILDS

pip install wilds
python wilds/download_datasets.py --root_dir data --dataset iwildcam

CelebA

    def __init__(self, version=None, root_dir='data', download=False, split_scheme='official', target_name='Male'):
        self._version = version
        self._data_dir = self.initialize_data_dir(root_dir, download)
        confounder_names = ['Male', 'Wearing_Hat', 'Smiling', 'Eyeglasses', 'Blond_Hair', 'Mustache', 
                            'Attractive', 'Wearing_Lipstick', 'Wearing_Necklace', 'Wearing_Necktie', 'Young', 'Bald']
        confounder_names.remove(target_name)

Spawrious dataset

Running Experiments

As described in our paper, there are three stages: Transition Matrix Estimation, Denoising In-Context Learning, and Fine-Tuning.

Transition Matrix Estimation

First, please run modeling_transition_matrix.py to generate a tranistion matrix tensor which is stored in .tran_mat/

CUDA_VISIBLE_DEVICES=0,1,2,3 python modeling_transition_matrix.py --dataset [specify general dataset] --chosen_name [specify detailed dataset] --vit_type [vision model type]  --targets [choosen target domain] --labeling_budget [number of labels for each class in the support set]

For example:

CUDA_VISIBLE_DEVICES=0,1,2,3 python modeling_transition_matrix.py --dataset imagenet --chosen_name ImageNetA --vit_type vit-l --labeling_budget 10

Denoising In-Context Learning

Then, please conduct DICL by running dicl.py:

CUDA_VISIBLE_DEVICES=0,1,2,3 python dicl.py --dataset [specify general dataset] --chosen_name [specify detailed dataset] --vit_type [vision model type]  --targets [choosen target domain] --num_retrieve [number of retrieved exemplars from the support set] --num_exemplar [number of exemplars in the support set] --stop_iteration [number of iterations for conducting DICL]

For example:

CUDA_VISIBLE_DEVICES=0,1,2,3 python dicl.py --dataset imagenet --chosen_name ImageNetA --vit_type vit-l --num_retrieve 3 --num_examplar 3 --stop_iteration 5000

After DICL, the therapy results will be stored in a .json file in .logits/. The format is:

{
mvt_acc: accuracies of MVT
clip_acc: accuracies of vision model,
logits: stored logits for further Fine-Tuning
}

where the logits is formed as:

{image_path: [top_N_predictions, top_N_logits]}

The prediction class index with largest logit value will be used a learning target for Fine-Tuning.

Fine-Tuning

Please run fine_tuning.py as follows:

CUDA_VISIBLE_DEVICES=0 python fine_tuning.py --dataset [specify general dataset] --chosen_name [specify detailed dataset] --vit_type [vision model type]  --targets [choosen target domain] 

For example:

CUDA_VISIBLE_DEVICES=0 python fine_tuning.py --dataset imagenet --chosen_name ImageNetA --vit_type vit-l

In fine_tuning.py, the data paths meta_file of all test data is stored in .txt files, and the therapy results teacher_json only contains a subset of test data that need to be enhanced.

Reference

<br> 📑 If you find our paper and code helpful for your research, please consider citing: <br>

@article{huang2023machine,
  title={Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning},
  author={Huang, Zhuo and Liu, Chang and Dong, Yinpeng and Su, Hang and Zheng, Shibao and Liu, Tongliang},
  journal={arXiv preprint arXiv:2312.02546},
  year={2023}
}

If you have any problems, please feel free to raise an issue or directly contact zhuohuang.ai@gmail.com.