Awesome

Extract Free Dense Labels from CLIP [Project Page]

        ███╗   ███╗ █████╗ ███████╗██╗  ██╗ ██████╗██╗     ██╗██████╗
        ████╗ ████║██╔══██╗██╔════╝██║ ██╔╝██╔════╝██║     ██║██╔══██╗
        ██╔████╔██║███████║███████╗█████╔╝ ██║     ██║     ██║██████╔╝
        ██║╚██╔╝██║██╔══██║╚════██║██╔═██╗ ██║     ██║     ██║██╔═══╝
        ██║ ╚═╝ ██║██║  ██║███████║██║  ██╗╚██████╗███████╗██║██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚══════╝╚═╝  ╚═╝ ╚═════╝╚══════╝╚═╝╚═╝

This is the code for our paper: Extract Free Dense Labels from CLIP.

This repo is a fork of mmsegmentation. So the installation and data preparation is pretty similar.

Installation

Step 0. Install PyTorch and Torchvision following official instructions, e.g.,

pip install torch torchvision
# FYI, we're using torch==1.9.1 and torchvision==0.10.1

Step 1. Install MMCV using MIM.

pip install -U openmim
mim install mmcv-full

Step 2. Install CLIP.

pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

Step 3. Install MaskCLIP.

git clone https://github.com/chongzhou96/MaskCLIP.git
cd MaskCLIP
pip install -v -e .
# "-v" means verbose, or more output
# "-e" means installing a project in editable mode,
# thus any local modifications made to the code will take effect without reinstallation.

Dataset Preparation

Please refer to dataset_prepare.md. In our paper, we experiment with Pascal VOC, Pascal Context, and COCO Stuff 164k.

MaskCLIP

MaskCLIP doesn't require any training. We only need to (1) download and convert the CLIP model and (2) prepare the text embeddings of the objects of interest.

Step 0. Download and convert the CLIP models, e.g.,

mkdir -p pretrain
python tools/maskclip_utils/convert_clip_weights.py --model ViT16 --backbone
# Other options for model: RN50, RN101, RN50x4, RN50x16, RN50x64, ViT32, ViT16, ViT14

Step 1. Prepare the text embeddings of the objects of interest, e.g.,

python tools/maskclip_utils/prompt_engineering.py --model ViT16 --class-set context
# Other options for model: RN50, RN101, RN50x4, RN50x16, ViT32, ViT16
# Other options for class-set: voc, context, stuff
# Actually, we've played around with many more interesting target classes. (See prompt_engineering.py)

Step 2. Get quantitative results (mIoU):

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} --eval mIoU
# e.g., python tools/test.py configs/maskclip/maskclip_vit16_520x520_pascal_context_59.py pretrain/ViT16_clip_backbone.pth --eval mIoU

Step 3. (optional) Get qualitative results:

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} --show-dir ${OUTPUT_DIR}
# e.g., python tools/test.py configs/maskclip/maskclip_vit16_520x520_pascal_context_59.py pretrain/ViT16_clip_backbone.pth --show-dir output/

MaskCLIP+

MaskCLIP+ trains another segmentation model with pseudo labels extracted from MaskCLIP.

Step 0. Download and convert the CLIP models, e.g.,

mkdir -p pretrain
python tools/maskclip_utils/convert_clip_weights.py --model ViT16
# Other options for model: RN50, RN101, RN50x4, RN50x16, RN50x64, ViT32, ViT16, ViT14

Step 1. Prepare the text embeddings of the target dataset, e.g.,

python tools/maskclip_utils/prompt_engineering.py --model ViT16 --class-set context
# Other options for model: RN50, RN101, RN50x4, RN50x16, ViT32, ViT16
# Other options for class-set: voc, context, stuff

Train. Depending on your setup (single/mutiple GPU(s), multiple machines), the training script can be different. Here, we give an example of multiple GPUs on a single machine. For more infomation, please refer to train.md.

sh tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM}
# e.g., sh tools/dist_train.sh configs/maskclip_plus/zero_shot/maskclip_plus_r50_deeplabv3plus_r101-d8_480x480_40k_pascal_context.py 4

Inference. See step 2 and step 3 under the MaskCLIP section. (We will release the trained models soon.)

Citation

If you use MaskCLIP or this code base in your work, please cite

@InProceedings{zhou2022maskclip,
    author = {Zhou, Chong and Loy, Chen Change and Dai, Bo},
    title = {Extract Free Dense Labels from CLIP},
    booktitle = {European Conference on Computer Vision (ECCV)},
    year = {2022}
}

Contact

For questions about our paper or code, please contact Chong Zhou.