Awesome

Open-Vocabulary Segmentation with Semantic-Assisted Calibration [CVPR 2024]

Yong Liu*, Sule Bai*, Guanbin Li, Yitong Wang, Yansong Tang (*equal contribution)

The repository contains the official implementation of "Open-Vocabulary Segmentation with Semantic-Assisted Calibration"

Paper

📖 Pipeline & Results

If you find any bugs due to carelessness on our part in organizing the code, feel free to contact us and point that!

Installation

Please see installation guide.

Data Preparation

Please follow the instruction of ov-seg to prepare the training and test data. The data should be organized like:

$DETECTRON2_DATASETS/
  coco/                 # COCOStuff-171
  ADEChallengeData2016/ # ADE20K-150
  ADE20K_2021_17_01/    # ADE20K-847
  VOCdevkit/
    VOC2012/            # PASCALVOC-20
    VOC2010/            # PASCALContext-59, PASCALContext-459

Usage

Pretrained Weight
We have provided the pretrained SCAN-VitL weights and the finetuned Contextual-shifted CLIP weights. Please download them from here.

Evaluation

python train_net.py --eval-only --config-file <CONFIG_FILE> --num-gpus <NUM_GPU> OUTPUT_DIR <OUTPUT_PATH> MODEL.WEIGHTS <TRAINED_MODEL_PATH>

Here is an example:

python train_net.py --num-gpu 8 --eval-only --config-file configs/scan_vitL.yaml MODEL.WEIGHTS ./SCAN.pth DATASETS.TEST \(\"ade20k_sem_seg_val\",\) MODEL.CLIP_ADAPTER.REPLACE_RATIO 0.05 MODEL.CLIP_ADAPTER.CLIP_ENSEMBLE_WEIGHT 0.75 MODEL.CLIP_ADAPTER.MASK_THR 0.55

Training

Train the segmentation model:

python train_net.py  --config-file <CONFIG_FILE> --num-gpus <NUM_GPU>

Here is an example:

python train_net.py  --num-gpu 8 --config-file configs/scan_vitL.yaml

Fuse segmentation model with finetuned CLIP.

We have provided the finetuned CLIP weights. You can directly fuse the pretrained weights with the segmentation model to get the final model. The fuse command is:

cd tools
python replace_clip.py

You need to specify the "clip_ckpt" and "ovseg_model" in the file according to your CLIP path and segmentation model path.

(Optional) If you want to finetune the CLIP model from scratch, please follow ov-seg to prepare the corresponding data. The finetued command is:

cd open_clip_training
cd src
bash scripts/finetune_VitL_with_mask.sh

Cite

If you find our work helpful, we'd appreciate it if you could cite our paper in your work.

@article{liu2023open,
  title={Open-Vocabulary Segmentation with Semantic-Assisted Calibration},
  author={Liu, Yong and Bai, Sule and Li, Guanbin and Wang, Yitong and Tang, Yansong},
  journal={arXiv preprint arXiv:2312.04089},
  year={2023}
}