Home

Awesome

Reinforcement Learning with CLIP Feedback :sparkles:

<!-- :sparkles: -->

The official implementation of Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models.

Table of Contents

<!--ts--> <!--te-->

News

Introduction

<div align="justify"> One fascinating aspect of pre-trained vision-language models~(VLMs) learning under language supervision is their impressive zero-shot generalization capability. However, this ability is hindered by distribution shifts between the training and testing data. Previous test time adaptation~(TTA) methods for VLMs in zero-shot classification rely on minimizing the entropy of model outputs, tending to be stuck in incorrect model predictions. In this work, we propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident. Specifically, a CLIP model is adopted as the reward model during TTA and provides feedback for the VLM. Given a single test sample, the VLM is forced to maximize the CLIP reward between the input and sampled results from the VLM output distribution. The proposed <strong>reinforcement learning with CLIP feedback~(RLCF)</strong> framework is highly flexible and universal. Beyond the classification task, with task-specific sampling strategies and a proper reward baseline choice, RLCF can be easily extended to not only discrimination tasks like retrieval but also generalization tasks like image captioning, improving the zero-shot generalization capacity of VLMs. According to the characteristics of these VL tasks, we build different fully TTA pipelines with RLCF to improve the zero-shot generalization ability of various VLMs. Extensive experiments along with promising empirical results demonstrate the effectiveness of RLCF. <div align=center> <img src="assets/clip-reward.png" style="zoom:100%"/></pr> </div> </div>

Features

Installation

The code in this repo about the three tasks are independent. You can step up them task by task.

Prepare data

First of all, you need to download the dataset and pre-trained models.

Generally, directories are organized as follows:

${ROOT}
├── dataset
│   │
│   ├──tta_data
│   │   ├──ImageNet
│   │   ├──imagenet-a
│   │   ├──imagenet-r
│   │   ├──ImageNet-Sketch
│   │   └──imagenetv2-matched-frequency-format-val
│   │       
│   ├──coco2014
│   ├──nocaps
│   └──flickr30k
│
├── code
│   └── RLCF
│       ├──caption
│       ├──clipscore
│       ├──retrieval
│       └──TPT  
│ 
├── output (save the output of the program)
│
│
├── pretrained
│       ├──opt-125m
│       ├──coop
│       │    └──coop_16shots_nctx4_cscFalse_ctpend_vitb16_seed1
│       │
│       └── clip (download the CLIP pre-trained weights and put them here)
│            └── ViT-B-16.pt
│
...

Dependency

Requires Python >= 3.8 and PyTorch >= 1.12. The following commands are tested on a Linux machine with CUDA Driver Version 525.105.17 and CUDA Version 11.7.

conda create --name rlcf python=3.8.5
pip install -r requirements.txt 

I use

torch==1.13.1+cu117
torchvision==0.14.1+cu117
--extra-index-url https://download.pytorch.org/whl/cu117

in the requirements file.

If you use other versions of cuda, simply remove them (the last 3 lines in the txt file) in requirements.txt then do

conda create --name rlcf python=3.8.5
conda install pytorch==1.13.1 torchvision==0.14.1 -c pytorch
pip install -r requirements.txt 

Classification

<div align=center> <img src="assets/cls.png" style="zoom:100%"/></pr> </div>

Then you can cd TPT/scripts,

bash rlcf-prompt.sh 0

To evaluate on ImageNet, ImageNet-V2, and ImageNet-Sketch (which has 1000 classes), you will need a GPU with more than (not including) 16GB memory.

bash rlcf-tune.sh 0

A 16GB GPU card should be enough.

Retrieval

<div align=center> <img src="assets/ret.png" style="zoom:100%"/></pr> </div>

Then you can cd retrieval/scripts,

bash tta_coco_ret.sh 0
bash tta_flickr_ret.sh 0

Captioning

<div align=center> <img src="assets/cap.png" style="zoom:100%"/></pr> </div>

Then you can cd caption/scripts,

bash tta_capdec_c2f.sh 0
bash tta_capdec_c2n.sh 0
bash tta_clipcap_c2f.sh 0
bash tta_clipcap_c2n.sh 0
bash train_capdec_coco.sh 0
bash train_clipcap_coco.sh 0

You need to download the CLIP-features-for-coco or CLIP-features-for-flikcr before training.

Citations

@inproceedings{
zhao2024testtime,
title={Test-Time Adaptation with {CLIP} Reward for Zero-Shot Generalization in Vision-Language Models},
author={Shuai Zhao and Xiaohan Wang and Linchao Zhu and Yi Yang},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=kIP0duasBb}
}

Acknowledgements

This repo is built upon these previous works.

<!--ts--> <!--te-->

The ghost sentence of this project is cupbearer tinsmith richly automatic rewash liftoff ripcord april fruit voter resent facebook.