Home

Awesome

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Created by Yongming Rao*, Wenliang Zhao*, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu.

This repository contains PyTorch implementation for DenseCLIP (CVPR 2022).

DenseCLIP is a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. By further using the contextual information from the image to prompt the language model, we are able to facilitate our model to better exploit the pre-trained knowledge. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones including both CLIP models and ImageNet pre-trained models.

intro

Our code is based on mmsegmentation and mmdetection.

[Project Page] [arXiv]

Usage

Requirements

To use our code, please first install the mmcv-full and mmseg/mmdet following the official guidelines (mmseg, mmdet) and prepare the datasets accordingly.

Pre-trained CLIP Models

Download the pre-trained CLIP models (RN50.pt, RN101.pt, VIT-B-16.pt) and save them to the pretrained folder. The download links can be found in the official CLIP repo.

Segmentation

Model Zoo

We provide DenseCLIP models for Semantic FPN framework.

ModelFLOPs (G)Params (M)mIoU(SS)mIoU(MS)configurl
RN50-CLIP248.831.039.641.6config-
RN50-DenseCLIP269.250.343.544.7configTsinghua Cloud
RN101-CLIP326.650.042.744.3config-
RN101-DenseCLIP346.367.845.146.5configTsinghua Cloud
ViT-B-CLIP1037.4100.849.450.3config-
ViT-B-DenseCLIP1043.1105.350.651.3configTsinghua Cloud

Training & Evaluation on ADE20K

To train the DenseCLIP model based on CLIP ResNet-50, run:

bash dist_train.sh configs/denseclip_fpn_res50_512x512_80k.py 8

To evaluate the performance with multi-scale testing, run:

bash dist_test.sh configs/denseclip_fpn_res50_512x512_80k.py /path/to/checkpoint 8 --eval mIoU --aug-test

To better measure the complexity of the models, we provide a tool based on fvcore to accurately compute the FLOPs of torch.einsum and other operations:

python get_flops.py /path/to/config --fvcore

You can also remove the --fvcore flag to obtain the FLOPs measured by mmcv for comparisons.

Detection

Model Zoo

We provide models for both RetinaNet and Mask-RCNN framework.

RetinaNet
ModelFLOPs (G)Params (M)box APconfigurl
RN50-CLIP2653836.9config-
RN50-DenseCLIP2856037.8configTsinghua Cloud
RN101-CLIP3415740.5config-
RN101-DenseCLIP3607841.1configTsinghua Cloud
Mask R-CNN
ModelFLOPs (G)Params (M)box APmask APconfigurl
RN50-CLIP3014439.336.8config-
RN50-DenseCLIP3276740.237.6configTsinghua Cloud
RN101-CLIP3776342.238.9config-
RN101-DenseCLIP3998442.639.6configTsinghua Cloud

Training & Evaluation on COCO

To train our DenseCLIP-RN50 using RetinaNet framework, run

 bash dist_train.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py 8

To evaluate the box AP of RN50-DenseCLIP (RetinaNet), run

bash dist_test.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py /path/to/checkpoint 8 --eval bbox

To evaluate both the box AP and the mask AP of RN50-DenseCLIP (Mask-RCNN), run

bash dist_test.sh configs/mask_rcnn_denseclip_r50_fpn_1x_coco.py /path/to/checkpoint 8 --eval bbox segm

License

MIT License

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{rao2021denseclip,
  title={DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting},
  author={Rao, Yongming and Zhao, Wenliang and Chen, Guangyi and Tang, Yansong and Zhu, Zheng and Huang, Guan and Zhou, Jie and Lu, Jiwen},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}