Home

Awesome

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Introduction

This is an official release of the paper CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction.

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction,
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy
Bibetex

TODO

Installation

This project is adapted from OpenCLIP-v2.16.0. Run the following command to install the package

pip install -e . -v

Data Preparation

The main experiments are conducted using images from COCO and LVIS datasets. Please prepare datasets and organize them like the following:

CLIPSelf/
├── data
    ├── coco
        ├── annotations
            ├── instances_train2017.json  # the box annotations are not used
            ├── panoptic_val2017.json
            ├── panoptic_val2017     # panoptic masks
        ├── train2017
        ├── val2017
        ├── coco_pseudo_4764.json    # to run RegionCLIP
        ├── coco_proposals.json      # to run CLIPSelf with region proposals
    ├── lvis_v1
        ├── annotations
            ├── lvis_v1_train.json  # the box annotations are not used
        ├── train2017    # the same with coco
        ├── val2017      # the same with coco

For CLIPSelf with region proposals or RegionCLIP that uses region-text pairs, obtain coco_pseudo_4764.json or coco_proposals.json from Drive. Put the json files under data/coco.

Run

Original Models

To run CLIPSelf, first obtain the original models from EVA-02-CLIP, and put them under checkpoints/ like the following:

CLIPSelf/
├── checkpoints
    ├── EVA02_CLIP_B_psz16_s8B.pt
    ├── EVA02_CLIP_L_336_psz14_s6B.pt
    

Training and Testing

We provide the scripts to train CLIPSelf and RegionCLIP under scripts/, they are summarized as follows:

#ModelMethodProposalsTraining DataScriptCheckpoint
1ViT-B/16CLIPSelf-COCOscriptmodel
2ViT-B/16CLIPSelf+COCOscriptmodel
3ViT-B/16RegionCLIP+COCOscriptmodel
4ViT-L/14CLIPSelf-COCOscriptmodel
5ViT-L/14CLIPSelf+COCOscriptmodel
6ViT-L/14RegionCLIP+COCOscriptmodel
7ViT-B/16CLIPSelf-LVISscriptmodel
8ViT-L/14CLIPSelf-LVISscriptmodel

For example, if we want to refine ViT-B/16 by CLIPSelf using only image patches on COCO, simply run:

bash scripts/train_clipself_coco_image_patches_eva_vitb16.sh    # 1

We also provide the checkpoints of the listed experiments above in Drive. And they can be organized as follows:

CLIPSelf/
├── checkpoints
    ├── eva_vitb16_coco_clipself_patches.pt     # 1
    ├── eva_vitb16_coco_clipself_proposals.pt   # 2
    ├── eva_vitb16_coco_regionclip.pt           # 3
    ├── eva_vitl14_coco_clipself_patches.pt     # 4
    ├── eva_vitl14_coco_clipself_proposals.pt   # 5
    ├── eva_vitl14_coco_regionclip.pt           # 6
    ├── eva_vitb16_lvis_clipself_patches.pt     # 7
    ├── eva_vitl14_lvis_clipself_patches.pt     # 8

To evaluate a ViT-B/16 model, run:

bash scripts/test_eva_vitb16_macc_boxes_masks.sh name_of_the_test path/to/checkpoint.pt

To evaluate a ViT-L/14 model, run:

bash scripts/test_eva_vitl14_macc_boxes_masks.sh name_of_the_test path/to/checkpoint.pt

F-ViT

Go to the folder CLIPSelf/F-ViT and follow the instructions in this README.

License

This project is licensed under NTU S-Lab License 1.0.

Citation

@article{wu2023clipself,
    title={CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction},
    author={Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Xiangtai Li and Wentao Liu and Chen Change Loy},
    journal={arXiv preprint arXiv:2310.01403},
    year={2023}
}

Acknowledgement

We thank OpenCLIP, EVA-CLIP and MMDetection for their valuable code bases.