Awesome

OmDet-Turbo

<a href="https://arxiv.org/abs/2403.06892"> [Paper 📄] </a> <a href="https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T"> [Model 🗂️] </a> Fast and accurate open-vocabulary end-to-end object detection

🗓️ Updates

09/26/2024：OmDet-Turbo has been integrated into Transformers version 4.45.0. The code is available at here, and the Hugging Face model is available at here.
07/05/2024: Our new open-source project, OmAget: A multimodal agent framework for solving complex tasks is available !!! Additionally, OmDet has been seamlessly integrated as an OVD tool within it. Feel free to delve into our innovative multimodal agent framework.
06/24/2024: Guidance for converting OmDet-Turbo to ONNX
03/25/2024: Inference code and a pretrained OmDet-Turbo-Tiny model released.
03/12/2024: Github open-source project created

🔗 Related Works

If you are interested in our research, we welcome you to explore our other wonderful projects.

🔆 How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection(AAAI24) 🏠Github Repository

🔆 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network(IET Computer Vision)

📖 Introduction

This repository is the official PyTorch implementation for OmDet-Turbo, a fast transformer-based open-vocabulary object detection model.

⭐️Highlights

OmDet-Turbo is a transformer-based real-time open-vocabulary detector that combines strong OVD capabilities with fast inference speed. This model addresses the challenges of efficient detection in open-vocabulary scenarios while maintaining high detection performance.
We introduce the Efficient Fusion Head, a swift multimodal fusion module designed to alleviate the computational burden on the encoder and reduce the time consumption of the head with ROI.
OmDet-Turbo-Base model, achieves state-of-the-art zero-shot performance on the ODinW and OVDEval datasets, with AP scores of 30.1 and 26.86, respectively.
The inference speed of OmDetTurbo-Base on the COCO val2017 dataset reach 100.2 FPS on an A100 GPU.

For more details, check out our paper Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head <img src="docs/turbo_model.jpeg" alt="model_structure" width="100%">

⚡️ Inference Speed

Comparison of inference speeds for each component in tiny-size model. <img src="docs/speed_compare.jpeg" alt="speed" width="100%">

🛠️ How To Install

Follow the Installation Instructions to set up the environments for OmDet-Turbo

🚀 How To Run

Local Inference

Download our pretrained model and the CLIP checkpoints.
Create a folder named resources, put downloaded models into this folder.
Run run_demo.py, the images with predicted results will be saved at ./outputs folder.

Run as a API Server

Download our pretrained model and the CLIP checkpoints.
Create a folder named resources, put downloaded models into this folder.
Run run_wsgi.py, the API server will be started at http://host_ip:8000/inf_predict, check http://host_ip:8000/docs to have a try.

We already added language cache while inferring with run_demo.py. For more details, please open and check run_demo.py scripts.

⚙️ How To Export ONNX Model

Replace OmDetV2Turbo in OmDet-Turbo_tiny_SWIN_T.yaml with OmDetV2TurboInfer
Run export.py, and the omdet.onnx will be exported.

In the above example, post processing is not included in onnx model , and all input size are fixed. You can add more post processing and change the input size according to your needs.

📦 Model Zoo

The performance of COCO and LVIS are evaluated under zero-shot setting.

Model	Backbone	Pre-Train Data	COCO	LVIS	FPS (pytorch/trt)	Weight
OmDet-Turbo-Tiny	Swin-T	O365,GoldG	42.5	30.3	21.5/140.0	weight

📝 Main Results

Citation

Please consider citing our papers if you use our projects:

@article{zhao2024real,
  title={Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head},
  author={Zhao, Tiancheng and Liu, Peng and He, Xuan and Zhang, Lu and Lee, Kyusong},
  journal={arXiv preprint arXiv:2403.06892},
  year={2024}
}

@article{zhao2024omdet,
  title={OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network},
  author={Zhao, Tiancheng and Liu, Peng and Lee, Kyusong},
  journal={IET Computer Vision},
  year={2024},
  publisher={Wiley Online Library}
}