Awesome
CLIPTrase
[ECCV24] Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
1. Introduction
CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects. Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.
Full paper and supplementary materials: arxiv
1.1. Global Patch
1.2. Model Architecture
2. Code
2.1. Environments
- base environment: pytorch==1.12.1, torchvision==0.13.1 (CUDA11.3)
python -m pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
- Detectron2 version: install detectron2==0.6 additionally
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
<br/>
2.2. Data preparation
-
We follow the detectron2 format of the datasets:
The specific processing process can refer to MaskFormer and SimSeg
Update
configs/dataset_cfg.py
to your own path
datasets/
--coco/
----...
----val2017/
----stuffthingmaps_detectron2/
------val2017/
--VOC2012/
----...
----images_detectron2/
------val/
----annotations_detectron2/
------val/
--pcontext/
----...
----val/
------image/
------label/
----pcontext_full/
----...
----val/
------image/
------label/
--ADEChallengeData2016/
----...
----images/
------validation/
----annotations_detectron2/
------validation/
--ADE20K_2021_17_01/
----...
----images/
------validation/
----annotations_detectron2/
------validation/
- You also can use your own dataset, mask sure that it has
image
andgt
file, and the value of each pixel in the gt image is its corresponding label. <br/>
2.3. Global patch demo
- We provide a demo of the global patch in the notebook
global_patch_demo.ipynb
, where you can visualize the global patch phenomenon mentioned in our paper. <br/>
2.4. Training-free OVSS
- Running with single GPU
python clip_self_correlation.py
-
Running with multiple GPUs in the detectron2 version
Update: We provide detectron2 framework version, the clip state keys are modified and can be found here, you can download and put it in
outputs
folder.Note: The results of the d2 version are slightly different from those in the paper due to differences in preprocessing and resolution.
python -W ignore train_net.py --eval-only --config-file configs/clip_self_correlation.yaml --num-gpus 4 OUTPUT_DIR your_output_path MODEL.WEIGHTS your_model_path
-
Results
single 3090, CLIP-B/16, evaluate in 9 situations on COCO, ADE, PASCAL CONTEXT, and VOC.
Our results do not use any post-processing such as densecrf.
Citation
- If you find this project useful, please consider citing:
@InProceedings{shao2024explore,
title={Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation},
author={Tong Shao and Zhuotao Tian and Hang Zhao and Jingyong Su},
booktitle={European Conference on Computer Vision},
organization={Springer},
year={2024}
}