Home

Awesome

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

News: this paper has been accepted by ECCV 2024

Official PyTorch implementation of SCLIP

Model components and our Correlative Self-Attention maps:

sclip_0

Open-vocabulary semantic segmentation samples:

sclip_1

Dependencies

This repo is built on top of CLIP and MMSegmentation. To run SCLIP, please install the following packages with your Pytorch environment. We recommend using Pytorch==1.10.x for better compatibility to the following MMSeg version.

pip install openmim
mim install mmcv==2.0.1 mmengine==0.8.4 mmsegmentation==1.1.1
pip install ftfy regex yapf==0.40.1

Datasets

We include the following dataset configurations in this repo: PASCAL VOC, PASCAL Context, Cityscapes, ADE20k, COCO-Stuff10k, and COCO-Stuff164k, with three more variant datasets VOC20, Context59 (i.e., PASCAL VOC and PASCAL Context without the background category), and COCO-Object.

Please follow the MMSeg data preparation document to download and pre-process the datasets. The COCO-Object dataset can be converted from COCO-Stuff164k by executing the following command:

python datasets/cvt_coco_object.py PATH_TO_COCO_STUFF164K -o PATH_TO_COCO164K

Remember to modify the dataset paths in the config files in config/cfg_DATASET.py

Run SCLIP

Single-GPU running:

python eval.py --config ./configs/cfg_DATASET.py --workdir YOUR_WORK_DIR

Multi-GPU running:

bash ./dist_test.sh ./configs/cfg_DATASET.py

Results

The performance of open-vocabulary inference can be affected by the text targets, i.e., the prompts and class names. This repo presents a easy way to explore them: you can modify prompts in prompts/imagenet_template.py, and class names in configs/cls_DATASET.text.

The repo automatically loads class names from the configs/cls_DATASET.text file. The rule of class names is that each category can have multiple class names, and these class names share one line in the file, separated by commas.

With the default setup in this repo, you should get the following results:

DatasetmIoU
ADE20k16.45
Cityscapes32.34
COCO-Object33.52
COCO-Stuff10k25.91
COCO-Stuff164k22.77
PASCAL Context5934.46
PASCAL Context6031.74
PASCAL VOC (w/o. bg.)81.54
PASCAL VOC (w. bg.)59.63

Citation

@article{wang2023sclip,
  title={SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference},
  author={Wang, Feng and Mei, Jieru and Yuille, Alan},
  journal={arXiv preprint arXiv:2312.01597},
  year={2023}
}