Home

Awesome

Enhancing Novel Object Detection via Cooperative Foundational Models

PWC PWC

Rohit K Bharadwaj, Muzammal Naseer, Salman Khan, Fahad Khan

paper

Official code for our paper "Enhancing Novel Object Detection via Cooperative Foundational Models"

:rocket: News

<hr>

method-diagram

Abstract: In this work, we address the challenging and emergent problem of novel object detection (NOD), focusing on the accurate detection of both known and novel object categories during inference. Traditional object detection algorithms are inherently closed-set, limiting their capability to handle NOD. We present a novel approach to transform existing closed-set detectors into open-set detectors. This transformation is achieved by leveraging the complementary strengths of pre-trained foundational models, specifically CLIP and SAM, through our cooperative mechanism. Furthermore, by integrating this mechanism with state-of-the-art open-set detectors such as GDINO, we establish new benchmarks in object detection performance. Our method achieves 17.42 mAP in novel object detection and 42.08 mAP for known objects on the challenging LVIS dataset. Adapting our approach to the COCO OVD split, we surpass the current state-of-the-art by a margin of 7.2 AP<sub>50</sub> for novel classes.

:trophy: Achievements and Features

:hammer_and_wrench: Setup and Installation

We have used python=3.8.15, and torch=1.10.1 for all the code in this repository. It is recommended to follow the below steps and setup your conda environment in the same way to replicate the results mentioned in this paper and repository.

  1. Clone this repository into your local machine as follows:
git clone git@github.com:rohit901/cooperative-foundational-models.git

or

git clone https://github.com/rohit901/cooperative-foundational-models.git
  1. Change the current directory to the main project folder (cooperative-foundational-models):
cd cooperative-foundational-models
  1. To install the project dependencies and libraries, use conda and install the defined environment from the .yml file by running:
conda env create -f environment.yml
  1. Activate the newly created conda environment:
conda activate coop_foundation_models 
  1. Install the Detectron2 v0.6 library via pip:
python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

Datasets

To download and setup the required datasets used in this work, please follow these steps:

  1. Download the COCO2017 dataset from their official website: https://cocodataset.org/#download. Specfically, download 2017 Train images, 2017 Val images, 2017 Test images, and their annotation files 2017 Train/Val annotations.
  2. Download the LVIS v1.0 annotations from: https://www.lvisdataset.org/dataset. There is no need to download images from this website as LVIS uses the same COCO2017 images. Specifically download the annotation files corresponding to the training set (1GB), and validation set (192 MB).
  3. Download extra/custom annotation files for COCO open-vocabulary splits from: COCO-OVD-Annotations, specifically download both ovd_instances_train2017_base.json, and ovd_instances_val2017_basetarget.json.
  4. Download extra/custom annotation file for lvis_val_subset dataset from: LVIS-Val-Subset, specifically download lvis_v1_val_subset.json.
  5. Detectron2 requires you to setup the datasets in a specific folder format/structure, for that it uses the environment variable DETECTRON2_DATASETS which is set equal to the path of the location containing all the different datasets. The file structure of DETECTRON2_DATASETS should be as follows:

The above file structure can also be seen from this onedrive link: link. Thus, the value for DETECTRON2_DATASETS or detectron2_dir in our code file should be the absolute path to the datasets directory which follows the above structure.

Model Weights

All the pre-trained model weights can be downloaded from this link: model weights. The folder contains the following model weights:

:mag_right: Novel Object Detection on LVIS Val Dataset

MethodMask-RCNNGDINOVLMNovel APKnown APAll AP
K-Means---0.2017.771.55
Weng et al---0.2717.851.62
ORCA---0.4920.572.03
UNO---0.6121.092.18
RNCDLV1--5.4225.006.92
GDINO--13.4737.1315.30
OursV2SigLIP17.4242.0819.33

Table 1: Comparison of object detection performance using mAP on the lvis_val dataset.

To replicate our results from the above table (i.e. Table 1 from the main paper):

  1. Modify scripts/novel_object_detection/params.json file:
    • Edit the key detectron2_dir and set it following instructions in Datasets
    • Edit the key sam_checkpoint and set the path to the downloaded file SAM_weights.pth
    • Edit the key gdino_checkpoint and set the path to the downloaded file GDINO_weights.pth
    • Edit the key rcnn_weight_dir and set the path to the downloaded folder maskrcnn_v2 [NOTE: DO NOT put a trailing slash]
  2. Run the following script from the main project directory as follows:
    python scripts/novel_object_detection/main.py
    

The above script periodically saves the predictions output in the outputs directory which is automatically created in the project level folder (i.e. cooperative-foundational-models/outputs). After executing the above script, the results will be printed to the console. Further, the final combined predictions of all the 19809 images in LVIS val dataset is saved as instances_predictions.pth, and can be used with scripts/novel_object_detection/evaluate_results_from_predictions.py to compute the final results.

NOTE: We were able to get slightly better overall result with our method using the code in this repository compared to the reported results in the paper:

MethodKnown APNovel APALL AP
Ours (Paper)42.0817.4219.33
Ours (GitHub)45.4317.2519.43

Inference on Custom Images

To detect LVIS class vocab (1203 classes) on your custom images:

  1. Please follow the previous instructions to properly setup the data, params.json, and the environment.
  2. Run python scripts/novel_object_detection/inference_single_image.py --image_path custom_image.jpg, you can replace custom_image.jpg with your own image and change the path accordingly.

The above script by default generates bounding box visualization of top-5 high scoring boxes. You may change the top-k visualization parameter by modifying the script. Alternatively, you may also choose to visualize the outputs based on confidence score threshold.

:medal_military: Open Vocabulary Detection on COCO OVD Dataset

MethodBackboneUse Extra Training SetNovel AP<sub>50</sub>
OVR-CNNRN5022.8
ViLDViT-B/3227.6
DeticRN5027.8
OV-DETRViT-B/3229.4
BARONRN5034
Rasheed et alRN5036.6
CORARN50x441.7
BARONRN5042.7
CORA+RN50x443.1
Ours*RN101 + SwinT50.3

Table 2: Results on COCO OVD benchmark. *Our approach with GDINO, SigLIP, and Mask-RCNN trained on COCO OVD split.

To replicate our results from the above table (i.e. Table 2 from the main paper):

  1. Obtain the trained Mask-RCNN model weights on COCO OVD dataset split.
    • Train the Mask-RCNN model from scratch:
      • Edit the values of DETECTRON2_DATASETS, CHECKPOINT_PATH in scripts/open_vocab_detection/train_mask_rcnn/train.batch
      • Start training by running: bash scripts/open_vocab_detection/train_mask_rcnn/train.batch
    • Alternatively, download the pre-trained weights of Mask-RCNN trained on COCO OVD from Model Weights, and edit detectron2_dir, sam_checkpoint, gdino_checkpoint, and rcnn_weight_dir values in scripts/open_vocab_detection/evaluate_method/params.json accordingly. For rcnn_weight_dir set the path to the downloaded folder MaskRCNN_COCO_OVD without trailing slash.
  2. Run the following script from main project directory as follows:
    python scripts/open_vocab_detection/evaluate_method/main.py
    

After executing the above script, the results will be displayed on the console. Ensure you follow the proper installation and setup steps mentioned in Datasets, and Model Weights.

:framed_picture: Qualitative Visualization

RNCDLGDINORCNN_CLIPOurs
<img src="visualizations/img_1_RNCDL.jpg" width="200"/><img src="visualizations/img_1_GDINO.jpg" width="200"/><img src="visualizations/img_1_MaskRCNN_CLIP.jpg" width="200"/><img src="visualizations/img_1_Ours.jpg" width="200"/>
<img src="visualizations/img_2_RNCDL.jpg" width="200"/><img src="visualizations/img_2_GDINO.jpg" width="200"/><img src="visualizations/img_2_MaskRCNN_CLIP.jpg" width="200"/><img src="visualizations/img_2_Ours.jpg" width="200"/>
<img src="visualizations/img_3_RNCDL.jpg" width="200"/><img src="visualizations/img_3_GDINO.jpg" width="200"/><img src="visualizations/img_3_MaskRCNN_CLIP.jpg" width="200"/><img src="visualizations/img_3_Ours.jpg" width="200"/>

To see additional and higher resolution visualizations, please visit the project website

:email: Contact

Should you have any questions, please create an issue in this repository or contact at rohit.bharadwaj@mbzuai.ac.ae

:pray: Acknowledgement

We thank the authors of GDINO, SAM, CLIP, and RNCDL for releasing their code.

:black_nib: Citation

If you found our work helpful, please consider starring the repository ⭐⭐⭐ and citing our work as follows:

@misc{bharadwaj2023enhancing,
      title={Enhancing Novel Object Detection via Cooperative Foundational Models}, 
      author={Rohit Bharadwaj and Muzammal Naseer and Salman Khan and Fahad Shahbaz Khan},
      year={2023},
      eprint={2311.12068},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}