Home

Awesome

Object Centric Open Vocabulary Detection (NeurIPS 2022)

Official repository of paper titled "Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection".

Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, Fahad Shahbaz Khan

Website paper Colab Demo video slides

PWC PWC PWC PWC

:rocket: News

<hr />

main figure

<p align="justify"> Abstract: Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training. We establish a bridge between the above two object-alignment strategies via a novel weight transfer function that aggregates their complimentary strengths. In essence, the proposed model seeks to minimize the gap between object and image-centric representations in the OVD setting. On the COCO benchmark, our proposed approach achieves 40.3 AP50 on novel classes, an absolute 11.9 gain over the previous best performance. For LVIS, we surpass the state-of-the-art ViLD model by 5.0 mask AP for rare categories and 3.4 overall. </p>

Main Contributions

  1. Region-based Knowledge Distillation (RKD) adapts image-centric language representations to be object-centric.
  2. Pesudo Image-level Supervision (PIS) uses weak image-level supervision from pretrained multi-modal ViTs(MAVL) to improve generalization of the detector to novel classes.
  3. Weight Transfer function efficiently combines above two proposed components.
<hr />

Installation

The code is tested with PyTorch 1.10.0 and CUDA 11.3. After cloning the repository, follow the below steps in INSTALL.md. All of our models are trained using 8 A100 GPUs.

<hr />

Demo: Create your own custom detector

Open In Colab Checkout our demo using our interactive colab notebook. Create your own custom detector with your own class names.

Results

We present performance of Object-centric Open Vocabulary object detector that demonstrates state-of-the-art results on Open Vocabulary COCO and LVIS benchmark datasets. For COCO, base and novel categories are shown in purple and green colors respectively. tSNE_plots

Open-vocabulary COCO

Effect of individual components in our method. Our weight transfer method provides complimentary gains from RKD and ILS, achieving superior results as compared to naively adding both components.

NameAPnovelAPbaseAPTrain-timeDownload
Base-OVD-RCNN-C41.753.239.68hmodel
COCO_OVD_Base_RKD21.254.745.98hmodel
COCO_OVD_Base_PIS30.452.646.88.5hmodel
COCO_OVD_RKD_PIS31.552.847.28.5hmodel
COCO_OVD_RKD_PIS_WeightTransfer36.654.049.48.5hmodel
COCO_OVD_RKD_PIS_WeightTransfer_8x36.956.651.52.5 daysmodel

New LVIS Baseline

Our Mask R-CNN based LVIS Baseline (mask_rcnn_R50FPN_CLIP_sigmoid) achieves 12.2 rare class and 20.9 overall AP and trains in only 4.5 hours on 8 A100 GPUs. We believe this could be a good baseline to be considered for the future research work in LVIS OVD setting.

NameAPrAPcAPfAPEpochs
PromptDet Baseline7.417.226.119.012
ViLD-text10.123.932.524.9384
Ours Baseline12.219.426.420.912
<br/>

Open-vocabulary LVIS

NameAPrAPcAPfAPTrain-timeDownload
mask_rcnn_R50FPN_CLIP_sigmoid12.219.426.420.94.5hmodel
LVIS_OVD_Base_RKD15.220.227.322.14.5hmodel
LVIS_OVD_Base_PIS17.021.226.122.45hmodel
LVIS_OVD_RKD_PIS17.320.925.522.15hmodel
LVIS_OVD_RKD_PIS_WeightTransfer17.121.426.722.85hmodel
LVIS_OVD_RKD_PIS_WeightTransfer_8x21.125.029.125.91.5 daysmodel

t-SNE plots

tSNE_plots

<hr />

Training and Evaluation

To train or evaluate, first prepare the required datasets.

To train a model, run the below command with the corresponding config file.

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml

Note: Some trainings are initialized from Supervised-base or RKD models. Download the corresponding pretrained models and place them under $object-centric-ovd/saved_models/.

To evaluate a pretrained model, run

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth
<hr />

Citation

If you use our work, please consider citing:

@inproceedings{Hanoona2022Bridging,
    title={Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection},
    author={Rasheed, Hanoona and Maaz, Muhammad and Khattak, Muhammad Uzair  and Khan, Salman and Khan, Fahad Shahbaz},
    booktitle={36th Conference on Neural Information Processing Systems (NIPS)},
    year={2022}
}
    
@inproceedings{Maaz2022Multimodal,
      title={Class-agnostic Object Detection with Multi-modal Transformer},
      author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz and Anwer, Rao Muhammad and Yang, Ming-Hsuan},
      booktitle={17th European Conference on Computer Vision (ECCV)},
      year={2022},
      organization={Springer}
}

Contact

If you have any questions, please create an issue on this repository or contact at hanoona.bangalath@mbzuai.ac.ae or muhammad.maaz@mbzuai.ac.ae.

References

Our RKD and PIS methods utilize the MViT model Multiscale Attention ViT with Late fusion (MAVL) proposed in the work Class-agnostic Object Detection with Multi-modal Transformer (ECCV 2022). Our code is based on Detic repository. We thank them for releasing their code.