Awesome

Taming Self-Training for Open-Vocabulary Object Detection

Official implementation of online self-training and a split-and-fusion (SAF) head for Open-Vocabulary Object Detection (OVD), SAS-Det for short. This project was named as Improving Pseudo Labels for Open-Vocabulary Object Detection.

arXiv

Installation

Our project is developed on Detectron2. Please follow the official installation instructions, OR the following instructions.

# create new environment
conda create -n sas_det python=3.8
conda activate sas_det

# install pytorch
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch

# install Detectron2 from a local clone
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2

Install CLIP

# install CLIP
pip install scipy
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

Datasets

Please follow RegionCLIP's dataset instructions to prepare COCO and LVIS datasets.
Download and put metadata for datasets in the folder datasets (i.e., $DETECTRON2_DATASETS used in the last step), which will be used in our evaluation and training.

Download pretrained weights

Download various RegionCLIP's pretrained weights. Check here for more details. Create a new folder pretrained_ckpt to put those weights. In this repository, regionclip, concept_emb and rpn will be used.
Download our pretrained weights and put them in corresponding folders in pretrained_ckpt. Our pretrained weights includes:
- r50_3x_pre_RegCLIP_cocoRPN_2: RPN weights pretrained only with COCO Base categories. This is used for experiments on COCO to avoid potential data leakage.
- concept_emb: Complementary to RegionCLIP's concept_emb.

Evaluation with released weights

Results on COCO-OVD

<table><tbody>   <th valign="bottom">Configs</th> <th valign="bottom">Novel AP</th> <th valign="bottom">Base AP</th> <th valign="bottom">Overall AP</th>   <tr><td align="left"><a href="./sas_det/configs/regionclip/COCO-InstanceSegmentation/customized/CLIP_fast_rcnn_R_50_C4_ovd_PLs.yaml">w/o SAF head</a></td> <td align="center">31.4</td> <td align="center">55.7</td> <td align="center">49.4</td> </tr>  <tr><td align="left"><a href="./sas_det/configs/ovd_coco_R50_C4_ensemble_PLs.yaml">with SAF head</a></td> <td align="center">37.4</td> <td align="center">58.5</td> <td align="center">53.0</td> </tr> </tbody></table> <details> <summary> Evaluation without the SAF Head (baseline in the paper), </summary>

python3 ./test_net.py \
    --num-gpus 8 \
    --eval-only \
    --config-file ./sas_det/configs/regionclip/COCO-InstanceSegmentation/customized/CLIP_fast_rcnn_R_50_C4_ovd_PLs.yaml \
    MODEL.WEIGHTS ./pretrained_ckpt/sas_det/sas_det_coco_no_saf_head_baseline.pth \
    MODEL.CLIP.OFFLINE_RPN_CONFIG ./sas_det/configs/regionclip/COCO-InstanceSegmentation/mask_rcnn_R_50_C4_1x_ovd_FSD.yaml \
    MODEL.CLIP.BB_RPN_WEIGHTS ./pretrained_ckpt/rpn/rpn_coco_48.pth \
    MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_65_cls_emb.pth \
    MODEL.CLIP.OPENSET_TEST_TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_65_cls_emb.pth \
    MODEL.ROI_HEADS.SOFT_NMS_ENABLED True \
    OUTPUT_DIR output/eval

</details> <details> <summary> Evaluation with the SAF Head, </summary>

python3 ./test_net.py \
    --num-gpus 8 \
    --eval-only \
    --config-file ./sas_det/configs/ovd_coco_R50_C4_ensemble_PLs.yaml \
    MODEL.WEIGHTS ./pretrained_ckpt/sas_det/sas_det_coco.pth \
    MODEL.CLIP.OFFLINE_RPN_CONFIG ./sas_det/configs/regionclip/COCO-InstanceSegmentation/mask_rcnn_R_50_C4_1x_ovd_FSD.yaml \
    MODEL.CLIP.BB_RPN_WEIGHTS ./pretrained_ckpt/rpn/rpn_coco_48.pth \
    MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_48_base_cls_emb.pth \
    MODEL.CLIP.CONCEPT_POOL_EMB ./pretrained_ckpt/concept_emb/my_coco_48_base_17_cls_emb.pth \
    MODEL.CLIP.OPENSET_TEST_TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_65_cls_emb.pth \
    MODEL.ROI_HEADS.SOFT_NMS_ENABLED True \
    MODEL.ENSEMBLE.TEST_CATEGORY_INFO "./datasets/coco_ovd_continue_cat_ids.json" \
    MODEL.ENSEMBLE.ALPHA 0.3 MODEL.ENSEMBLE.BETA 0.7 \
    OUTPUT_DIR output/eval

</details>

Results on LVIS-OVD

<table><tbody>   <th valign="bottom">Configs</th> <th valign="bottom">APr</th> <th valign="bottom">APc</th> <th valign="bottom">APf</th> <th valign="bottom">AP</th>   <tr><td align="left"><a href="./sas_det/configs/ovd_lvis_R50_C4_ensemble_PLs.yaml">RN50-C4 as backbone</a></td> <td align="center">20.1</td> <td align="center">27.1</td> <td align="center">32.9</td> <td align="center">28.1</td> </tr>  <tr><td align="left"><a href="./sas_det/configs/ovd_lvis_R50_C4_ensemble_PLs.yaml">RN50x4-C4 as backbone</a></td> <td align="center">29.0</td> <td align="center">32.3</td> <td align="center">36.8</td> <td align="center">33.5</td> </tr> </tbody></table> <details> <summary> Evaluation with RN50-C4 as the backbone, </summary>

python3 ./test_net.py \
    --num-gpus 8 \
    --eval-only \
    --config-file ./sas_det/configs/ovd_lvis_R50_C4_ensemble_PLs.yaml \
    MODEL.WEIGHTS ./pretrained_ckpt/sas_det/sas_det_lvis_r50.pth \
    MODEL.CLIP.OFFLINE_RPN_CONFIG ./sas_det/configs/regionclip/LVISv1-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml \
    MODEL.CLIP.BB_RPN_WEIGHTS ./pretrained_ckpt/rpn/rpn_lvis_866_lsj.pth \
    MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/lvis_866_base_cls_emb.pth \
    MODEL.CLIP.CONCEPT_POOL_EMB ./pretrained_ckpt/concept_emb/my_lvis_866_base_337_cls_emb.pth \
    MODEL.CLIP.OPENSET_TEST_TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/lvis_1203_cls_emb.pth \
    MODEL.CLIP.OFFLINE_RPN_LSJ_PRETRAINED True \
    MODEL.ENSEMBLE.TEST_CATEGORY_INFO "./datasets/lvis_ovd_continue_cat_ids.json" \
    MODEL.ENSEMBLE.ALPHA 0.33 MODEL.ENSEMBLE.BETA 0.67 \
    OUTPUT_DIR output/eval

</details> <details> <summary> Evaluation with RN50x4-C4 as the backbone, </summary>

python3 ./test_net.py \
    --num-gpus 8 \
    --eval-only \
    --config-file ./sas_det/configs/ovd_lvis_R50_C4_ensemble_PLs.yaml \
    MODEL.WEIGHTS ./pretrained_ckpt/sas_det/sas_det_lvis_r50x4.pth \
    MODEL.CLIP.OFFLINE_RPN_CONFIG ./sas_det/configs/regionclip/LVISv1-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml \
    MODEL.CLIP.BB_RPN_WEIGHTS ./pretrained_ckpt/rpn/rpn_lvis_866_lsj.pth \
    MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/lvis_866_base_cls_emb_rn50x4.pth \
    MODEL.CLIP.CONCEPT_POOL_EMB ./pretrained_ckpt/concept_emb/my_lvis_866_base_337_cls_emb_rn50x4.pth \
    MODEL.CLIP.OPENSET_TEST_TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/lvis_1203_cls_emb_rn50x4.pth \
    MODEL.CLIP.OFFLINE_RPN_LSJ_PRETRAINED True \
    MODEL.CLIP.TEXT_EMB_DIM 640 \
    MODEL.RESNETS.DEPTH 200 \
    MODEL.ROI_BOX_HEAD.POOLER_RESOLUTION 18 \
    MODEL.ROI_MASK_HEAD.POOLER_RESOLUTION 18 \
    MODEL.ENSEMBLE.TEST_CATEGORY_INFO "./datasets/lvis_ovd_continue_cat_ids.json" \
    MODEL.ENSEMBLE.ALPHA 0.33 MODEL.ENSEMBLE.BETA 0.67 \
    OUTPUT_DIR output/eval

</details>

Acknowledgement

This repository was built on top of Detectron2, RegionCLIP, and VLDet. We thank the effort from our community.