Awesome

PromptDet: Towards Open-vocabulary Detection using Uncurated Images (ECCV 2022)

Introduction

The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-language model; (ii) To pair the visual latent space (of RPN box proposals) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to align the textual embedding space with regional visual object features; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource via a novel self-training framework, which allows to train the proposed detector on a large corpus of noisy uncurated web images. Lastly, (iv) to evaluate our proposed detector, termed as PromptDet, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset. PromptDet shows superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever.

Training framework

method overview

updates

July 20, 2022: add the code for LAION-novel and self-training
March 28, 2022: initial release

Prerequisites

MMDetection version 2.16.0.
Please see get_started.md for installation and the basic usage of MMDetection.

Regional Prompt Learning (RPL)

We learn the prompt vectors in an off-line manner using RPL. For your convenience, we also provide the learned prompt vectors and the category embeddings.

LAION-novel dataset

The LAION-novel dataset based on the learned category embeddings can be generated by using the PromptDet tools as follows:

# stege-I: install the dependencies, download the laion400m 64GB image.index and metadata.hdf5 (https://the-eye.eu/public/AI/cah/), and then retrival the LAION images (urls)
pip install faiss-cpu==1.7.2 img2dataset==1.12.0 fire==0.4.0 h5py==3.6.0
python tools/promptdet/retrieval_laion_image.py --indice-folder [laion400m-64GB-index] --metadata [metadata.hdf5] --text-features promptdet_resources/lvis_category_embeddings.pt --output-folder data/laion_lvis/images --num-images 500

# stege-II: download the LAION images
python tools/promptdet/download_laion_image.py --output-folder data/laion_lvis/images --num-thread 10

# stege-III: convert the LAION images to mmdetection format
python tools/promptdet/laion_dataset_converter.py --data-path data/laion_lvis/images --out-file data/laion_lvis/laion_train.json --topK 300

For your convenience, we also provide the image urls of our LAION-novel dataset.

Inference

# assume that you are under the root directory of this project,
# and you have activated your virtual environment if needed,
# and with LVIS v1.0 dataset in 'data/lvis_v1'.

./tools/dist_test.sh configs/promptdet/promptdet_r50_fpn_sample1e-3_mstrain_1x_lvis_v1_self_train.py work_dirs/promptdet_r50_fpn_sample1e-3_mstrain_1x_lvis_v1_self_train.pth 4 --eval bbox segm

Train

# download 'lvis_v1_train_seen.json' to 'data/lvis_v1/annotations'.

# train detector without self-training
./tools/dist_train.sh configs/promptdet/promptdet_r50_fpn_sample1e-3_mstrain_1x_lvis_v1.py 4

# train detector with self-training
./tools/dist_train.sh configs/promptdet/promptdet_r50_fpn_sample1e-3_mstrain_1x_lvis_v1_self_train.py 4

[0] Annotation file of base categories: lvis_v1_train_seen.json.
[1] Note that we provide a EpochPromptDetRunner to fetch the data from mutilple datasets alternately.

Models

For your convenience, we provide the following trained models (PromptDet) with mask AP.

Model	RPL	Self-training	Epochs	Scale Jitter	Input Size	AP<sub>novel	AP<c>c	AP<sub>f	AP	Download
Baseline (manual prompt)			12	640~800	800x800	7.4	17.2	26.1	19.0	google
PromptDet_R_50_FPN_1x	✓		12	640~800	800x800	11.5	19.4	26.7	20.9	google
PromptDet_R_50_FPN_1x	✓	✓	12	640~800	800x800	19.5	18.2	25.6	21.3	google
PromptDet_R_50_FPN_6x	✓	✓	72	100~1280	800x800	21.7	23.2	29.6	25.5	google

[0] All results are obtained with a single model and without any test time data augmentation such as multi-scale, flipping and etc..
[1] Refer to more details in config files in config/promptdet/.

Acknowledgement

Thanks MMDetection team for the wonderful open source project!

Citation

If you find PromptDet useful in your research, please consider citing:

@inproceedings{feng2022promptdet,
    title={PromptDet: Towards Open-vocabulary Detection using Uncurated Images},
    author={Feng, Chengjian and Zhong, Yujie and Jie, Zequn and Chu, Xiangxiang and Ren, Haibing and Wei, Xiaolin and Xie, Weidi and Ma, Lin},
    journal={Proceedings of the European Conference on Computer Vision},
    year={2022}
}