Home

Awesome

Language-conditioned Detection Transformer

<p align="center"> <img src='figs/teaser_light.png' align="center" width="70%"> </p>

Language-conditioned Detection Transformer
Jang Hyun Cho and Philipp Krähenbühl
CVPR 2024 ([pdf][supp])

What is DECOLA?

We design a new open-vocabulary detection framework that adjusts the inner mechanism of the object detector to the concepts it reasons over. This language-conditioned detector (DECOLA) trains as easily as classical detectors, but generalizes much better to novel concepts. DECOLA trains in three steps: (1) Learning to condition to a set of concept. (2) pseudo-labeling image-level data to scale-up training data. (3) learning general-purpose detector for downstream open-vocabulary detection. We show strong zero-shot performance in open-vocabulary and standard LVIS benchmarks. [Full abstract]

TL;DR: We design a special detector for pseudo-labeling and scale-up open-vocabulary detection through self-training.

Please feel free to reach out for any questions or discussions!

📧 Jang Hyun Cho [email]

🔥 News 🔥

Features

Installation

See installation instructions.

Demo

We provide demo based on detectron2 demo interface.

DECOLA Phase 1: Language-conditioned detection.

First, please download appropriate model checkpoint. Then, you can run demo as following

python demo.py --config-file configs/DECOLA_PHASE1_L_CLIP_SwinB_4x.yaml --input figs/input/pizza.jpg --output figs/output/pizza.jpg --vocabulary custom --custom_vocabulary cola,piza,fork,knif,table --confidence-threshold 0.3 --language-condition --opts MODEL.WEIGHTS weights/DECOLA_PHASE1_L_CLIP_SwinB_4x.pth 

Above model is DECOLA Phase 1 with Swin-B backbone (config), which has learned only from LVIS dataset. If setup properly, the output image should look like below:

<p align="center"> <img src='figs/output/pizza.jpg'width=500, align="center"> </p>

Note that cola is not in LVIS vocabulary as well as piza and knif have intended typos. Similarly,

python demo.py --config-file configs/DECOLA_PHASE1_L_CLIP_SwinB_4x.yaml --input figs/input/cola.jpg --output figs/output/cola.jpg --vocabulary custom --custom_vocabulary cola,cat,mentos,table --confidence-threshold 0.3 --language-condition --opts MODEL.WEIGHTS weights/DECOLA_PHASE1_L_CLIP_SwinB_4x.pth 
<p align="center"> <img src='figs/output/cola.jpg' width=600, align="center"> </p>

Above DECOLA predicts mentos and cola successfully, which are again outside LVIS vocabulary.

DECOLA Phase 2: General-purpose detection.

General-purpose detection with Phase 2 of DECOLA is also available for both custom vocabulary

python demo.py --config-file configs/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.yaml --input figs/input/desk.jpg --output figs/output/desk1.jpg --vocabulary custom --custom_vocabulary water_bottle,wallet,webcam,mug,headphone,drawer,keyboard,laptop,straw,mouse,paper,plastic_bag --confidence-threshold 0.2 --opts MODEL.WEIGHTS weights/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.pth 
<p align="center"> <img src='figs/output/desk1.jpg' width=600, align="center"> </p>

and a pre-defined vocabulary (e.g., LVIS).

python demo.py --config-file configs/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.yaml --input figs/input/desk.jpg --output figs/output/desk2.jpg --vocabulary lvis --confidence-threshold 0.2 --opts MODEL.WEIGHTS weights/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.pth 
<p align="center"> <img src='figs/output/desk2.jpg' width=600, align="center"> </p>

Integrating Segment Anything Model

We combine DECOLA's powerful language-conditioned, open-vocabulary detection and Segment Anything Model (SAM). DECOLA's box output prompts SAM to generate high-quality class-aware instance segmentation. Simply install SAM and add --use-sam flag:

python demo.py --config-file configs/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.yaml --input figs/input/desk.jpg --output figs/output_sam/desk2.jpg --vocabulary lvis --confidence-threshold 0.2 --use-sam --opts MODEL.WEIGHTS weights/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.pth 
<p align="center"> <img src='figs/output_sam/desk2.jpg' width=600, align="center"> </p>

Image credit: David Fouhey.

Training DECOLA

Please prepare datasets first, and follow training scripts to reproduce our results.

Testing DECOLA

Check out for all the checkpoints of our model as well as baselines.

Here are the highlight results:

Open-vocabulary LVIS with Deformable DETR

namebackbonebox AP_novelbox mAP
baselineResNet-509.432.2
+ self-trainResNet-5023.236.2
DECOLA (ours)ResNet-5027.638.3
baselineSwin-B16.241.1
+ self-trainSwin-B30.842.3
DECOLA (ours)Swin-B35.746.3
baselineSwin-L21.949.6
+ self-trainSwin-L36.551.8
DECOLA (ours)Swin-L46.955.2

Direct zero-shot transfer to LVIS minival

namebackbonedataAP_rAP_cAP_fmAP
DECOLASwin-TO365, IN21K32.832.031.832.0
DECOLASwin-LO365, OID, IN21K41.538.034.936.8

Direct zero-shot transfer to LVIS v1.0

namebackbonedataAP_rAP_cAP_fmAP
DECOLASwin-TO365, IN21K27.224.928.026.6
DECOLASwin-LO365, OID, IN21K32.929.130.330.2

Open-vocabulary LVIS with CenterNet2

namebackbonebox AP_novelbox mAPmask AP_novelmask mAP
DECOLAResNet-5029.537.727.033.7
DECOLASwin-B38.446.735.342.0

Standard LVIS with Deformable DETR

namebackbonebox AP_rarebox mAP
baselineResNet-5026.335.6
+ self-trainResNet-5030.036.6
DECOLA (ours)ResNet-5035.939.4
baselineSwin-B38.344.5
+ self-trainSwin-B42.045.2
DECOLA (ours)Swin-B47.448.3
baselineSwin-L49.354.4
+ self-trainSwin-L48.753.4
DECOLA (ours)Swin-L54.956.4

Standard LVIS with CenterNet2

namebackbonebox AP_rarebox mAPmask AP_raremask mAP
DECOLA (ours)ResNet-5035.638.632.134.4
DECOLA (ours)Swin-B47.648.543.743.6

Analyzing DECOLA

Here we provide code for analyses of our model as well as baselines.

License

The majority of DECOLA is licensed under the Apache 2.0 license. However, this work largely builds off of Detic, Deformable DETR, and Detectron2. We also provide optional integration with Segment Anything Model. Please refer to their original licenses for more details.

Citation

If you find this project useful for your research, please cite our paper using the following bibtex.

@InProceedings{Cho_2024_CVPR,
    author    = {Cho, Jang Hyun and Kr\"ahenb\"uhl, Philipp},
    title     = {Language-conditioned Detection Transformer},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {16593-16603}
}