Awesome
<div align=center> <h1> CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No </h1> </div> <div align=center> <a src="https://img.shields.io/badge/%F0%9F%93%96-ICCV_2023-red.svg?style=flat-square" href="https://arxiv.org/abs/2308.12213"> <img src="https://img.shields.io/badge/%F0%9F%93%96-ICCV_2023-red.svg?style=flat-square"> </a> <a src="https://img.shields.io/badge/%F0%9F%9A%80-xmed_Lab-ed6c00.svg?style=flat-square" href="https://xmengli.github.io/"> <img src="https://img.shields.io/badge/%F0%9F%9A%80-xmed_Lab-ed6c00.svg?style=flat-square"> </a> </div>:rocket: Updates
- The codes of CLIPN with hand-crafted prompts are released (./hand-crafted).
- The codes of CLIPN with learnable prompts are released (./src).
- Thanks to the valuable suggestions from the reviewers of CVPR 2023 and ICCV 2023, our paper has been significantly improved, allowing it to be published at ICCV 2023.
- If you are interested in CLIP-based open vocabulary tasks, please feel free to visit our another work! "CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks" (github).
:star: Highlights of CLIPN
- CLIPN attains SoTA performance in zero-shot OOD detection, all the while inheriting the in-distribution (ID) classification prowess of CLIP.
- CLIPN offers an approach for unsupervised prompt learning using image-text-paired web-dataset.
:hammer: Installation
- Main python libraries of our experimental environment are shown in requirements.txt. You can install CLIPN following:
git clone https://github.com/xmed-lab/CLIPN.git
cd CLIPN
conda create -n CLIPN
conda activate CLIPN
pip install -r ./requirements.txt
:computer: Prepare Dataset
- Pre-training Dataset, CC3M. To download CC3M dataset as webdataset, please follow img2dataset.
When you have downloaded CC3M, please re-write your data root into ./src/run.sh.
- OOD detection datasets.
- ID dataset, ImageNet-1K: The ImageNet-1k dataset (ILSVRC-2012) can be downloaded here.
- OOD dataset, iNaturalist, SUN, Places, and Texture. Please follow instruction from these two repositories MOS and MCM to download the subsampled datasets where semantically overlapped classes with ImageNet-1k are removed.
When you have downloaded the above datasets, please re-write your data root into ./src/tuning_util.py.
:key: Pre-Train and Evaluate CLIPN
- Pre-train CLIPN on CC3M. This step is to empower "no" logic within CLIP via the web-dataset.
- The model of CLIPN is defined in ./src/open_clip/model.py. Here, you can find a group of learnable 'no' token embeddings defined in Line 527.
- The function of loading parameters of CLIP is defined in ./src/open_clip/factory.py.
- The loss functions are defined in ./src/open_clip/loss.py.
- You can pre-train CLIPN on ViT-B-32 and ViT-B-16 by:
cd ./src
sh run.sh
- Zero-Shot Evaluate CLIPN on ImageNet-1K.
- Metrics and pipeline are defined in ./src/zero_shot_infer.py. Here you can find three baseline methods, and our two inference algorithms: CTW and ATD (see Line 91-96).
- Dataset details are defined in ./src/tuning_util.py.
- Inference models are defined in ./src/classification.py, including converting the text encoders into classifiers.
- You can download the models provided in the table below or pre-trained by yourself. Then re-write the path of your models in the main function of ./src/zero_shot_infer.py. Finally, evaluate CLIPN by:
python3 zero_shot_infer.py
:blue_book: Reproduced Results
To ensure the reproducibility of the results, we conducted three repeated experiments under each configuration. The following will exhibit the most recent reproduced results achieved before open-sourcing.
- ImageNet-1K
<font color='red'> The performance in this table is better than our paper </font>, because that we add an average learnable "no" prompt (see Line 600-616 in ./src/open_clip/model.py).
:pencil: Other Tips
There are several important factors that could affect the performance:
- Class prompt texts. In the inference period, we need to use prompt texts to get the weights of classifier (see ./src/prompt/prompt.txt). You can hand on the design of high-performance inference prompts for our CLIPN.
- The number of learnable "no" tokens. Now I just define the number of learnable "no" tokens as 16. You can vary it to find an optimal value.
- If you have any ideas to enhance CLIPN or attempt to transfer this idea to other topics, feel free to discuss with me and I am happy to share some ideas with you.
:books: Citation
If you find our paper helps you, please kindly consider citing our paper in your publications.
@inproceedings{wang2023clipn,
title={CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No},
author={Wang, Hualiang and Li, Yi and Yao, Huifeng and Li, Xiaomeng},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={1802--1812},
year={2023}
}
:beers: Acknowledge
We sincerely appreciate these three highly valuable repositories open_clip, MOS and MCM.