Home

Awesome

Detecting Twenty-thousand Classes using Image-level Supervision

Detic: A Detector with image classes that can use image-level labels to easily train detectors.

<p align="center"> <img src='docs/teaser.jpeg' align="center" height="300px"> </p>

Detecting Twenty-thousand Classes using Image-level Supervision,
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra,
ECCV 2022 (arXiv 2201.02605)

Features

Installation

See installation instructions.

Demo

Update April 2022: we released more real-time models here.

Replicate web demo and docker image: Replicate

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the web demo: Hugging Face Spaces

Run our demo using Colab (no GPU needed): Open In Colab

We use the default detectron2 demo interface. For example, to run our 21K model on a messy desk image (image credit David Fouhey) with the lvis vocabulary, run

mkdir models
wget https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth -O models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
wget https://eecs.engin.umich.edu/~fouhey/fun/desk/desk.jpg
python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out.jpg --vocabulary lvis --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

If setup correctly, the output should look like:

<p align="center"> <img src='docs/example_output_lvis.jpeg' align="center" height="450px"> </p>

The same model can run with other vocabularies (COCO, OpenImages, or Objects365), or a custom vocabulary. For example:

python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out2.jpg --vocabulary custom --custom_vocabulary headphone,webcam,paper,coffe --confidence-threshold 0.3 --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

The output should look like:

<p align="center"> <img src='docs/example_output_custom.jpeg' align="center" height="450px"> </p>

Note that headphone, paper and coffe (typo intended) are not LVIS classes. Despite the misspelled class name, our detector can produce a reasonable detection for coffe.

Benchmark evaluation and training

Please first prepare datasets, then check our MODEL ZOO to reproduce results in our paper. We highlight key results below:

License

The majority of Detic is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms: SWIN-Transformer, CLIP, and TensorFlow Object Detection API are licensed under the MIT license; UniDet is licensed under the Apache 2.0 license; and the LVIS API is licensed under a custom license. If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than CC-BY-NC, MIT, or CC0

Ethical Considerations

Detic's wide range of detection capabilities may introduce similar challenges to many other visual recognition and open-set recognition methods. As the user can define arbitrary detection classes, class design and semantics may impact the model output.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@inproceedings{zhou2022detecting,
  title={Detecting Twenty-thousand Classes using Image-level Supervision},
  author={Zhou, Xingyi and Girdhar, Rohit and Joulin, Armand and Kr{\"a}henb{\"u}hl, Philipp and Misra, Ishan},
  booktitle={ECCV},
  year={2022}
}