Home

Awesome

<h1 align="center">DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding</h1> <div align=center>

The World's Top-Performing Vision Model for Open-World Object Detection

The project provides examples for using DINO-X, which is hosted on DeepDataSpace.

IDEA Research

</div> <div align=center>

arXiv preprint Homepage

</div>

Video Name

Highlights

Beyond Grounding DINO 1.5, DINO-X has several improvements, taking a step forward towards becoming a more general object-centric vision model. The highlights of the DINO-X are as follows:

The Strongest Open-Set Detection Performance: DINO-X Pro set new SOTA results on zero-shot transfer detection benchmarks: 56.0 AP on COCO, 59.8 AP on LVIS-minival and 52.4 AP on LVIS-val. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, improving the previous SOTA performance by 5.8 box AP and 5.0 box AP. Such a result underscores its significantly enhanced capacity for recognizing long-tailed objects.

🔥 Diverse Input Prompt and Multi-level Output Semantic Representations: DINO-X can accept text prompts, visual prompts, and customized prompts as input, and it outputs representations at various semantic levels, including bounding boxes, segmentation masks, pose keypoints, and object captions, with multiple perception heads.

🍉 Rich and Practical Capabilities: DINO-X can simultaneously support lots of highly practical tasks, including Open-Set Object Detection and Segmentation, Phrase Grounding, Visual-Prompt Counting, Pose Estimation, and Region Captioning. We further develop a universal object prompt to achieve Prompt-Free Anything Detection and Recognition.

Latest News

Contents

Model Framework

DINO-X can accept text prompts, visual prompts, and customized prompts as input, and it can generate representations at various semantic levels, including bounding boxes, segmentation masks, pose keypoints, and object captions.

<div align="center"> <img src="./assets/dino_x_overall_framework.jpg" width="90%"> </div>

Performance

Side-by-Side Performance Comparison with Previous Best Methods

<div align="center"> <img src="./assets/dino_x_performance.jpg" width="90%"> </div>

Zero-Shot Performance on Object Detection Benchmarks

<table align="center"> <thead> <tr> <th>Model</th> <th>COCO <br><sup><sup>(AP box)</sup></sup></th> <th>LVIS-minival <br><sup><sup>(AP all)</sup></sup></th> <th>LVIS-minival <br><sup><sup>(AP rare)</sup></sup></th> <th>LVIS-val <br><sup><sup>(AP all)</sup></sup></th> <th>LVIS-val <br><sup><sup>(AP rare)</sup></sup></th> </tr> </thead> <tbody align="center"> <tr> <td>Other Best<br>Open-Set Model</td> <td>53.4<br><sup><sup>(OmDet-Turbo)</sup></sup></td> <td>47.6<br><sup><sup>(T-Rex2 visual)</sup></sup></td> <td>45.4<br><sup><sup>(T-Rex2 visual)</sup></sup></td> <td>45.3<br><sup><sup>(T-Rex2 visual)</sup></sup></td> <td>43.8<br><sup><sup>(T-Rex2 visual)</sup></sup></td> </tr> <tr> <td>DetCLIPv3</td> <td> - </td> <td>48.8</td> <td>49.9</td> <td>41.4</td> <td>41.4</td> </tr> <tr> <td>Grounding DINO</td> <td>52.5</td> <td>27.4</td> <td>18.1</td> <td> - </td> <td> - </td> </tr> <tr> <td>T-Rex2 (text)</td> <td>52.2</td> <td>54.9</td> <td>49.2</td> <td> 45.8 </td> <td> 42.7 </td> </tr> <tr> <td>Grounding DINO 1.5 Pro</td> <td>54.3</td> <td>55.7</td> <td>56.1</td> <td>47.6</td> <td>44.6</td> </tr> <tr> <td>Grounding DINO 1.6 Pro</td> <td>55.4</td> <td>57.7</td> <td>57.5</td> <td>51.1</td> <td>51.5</td> </tr> <tr> <td><b>DINO-X Pro</b></td> <td><b>56.0</b></td> <td><b>59.7</b></td> <td><b>63.3</b></td> <td><b>52.4</b></td> <td><b>56.5</b></td> </tr> </tbody> </table>

Zero-Shot Performance on Segmentation Benchmarks

<table align="center"> <thead> <tr> <th>Model</th> <th>COCO <br><sup><sup>(AP mask)</sup></sup></th> <th>LVIS-minival <br><sup><sup>(AP mask)</sup></sup></th> <th>LVIS-minival <br><sup><sup>(AP mask rare)</sup></sup></th> <th>LVIS-val <br><sup><sup>(AP mask)</sup></sup></th> <th>LVIS-val <br><sup><sup>(AP mask rare)</sup></sup></th> </tr> </thead> <tbody align="center"> <tr> <td colspan="6" style="text-align:center;"> <em>Assembled General Perception Model</em> </td> </tr> <tr> <td>Grounded SAM <small>(1.5 Pro + Huge)</small></td> <td>44.3</td> <td>47.7</td> <td>50.2</td> <td>41.8</td> <td>46.0</td> </tr> <tr> <td>Grounded SAM 2 <small>(1.5 Pro + Large)</small></td> <td> <b> 44.7 </b> </td> <td>46.2</td> <td>50.1</td> <td>40.5</td> <td>44.6</td> </tr> <tr> <td> <b>DINO-X Pro + SAM-Huge</b> </td> <td>44.2</td> <td><b>51.2</b></td> <td><b>52.2</b></td> <td> - </td> <td> - </td> </tr> <tr> <td colspan="6" style="text-align:center;"> <em>Unified Vision Model</em> </td> </tr> <tr> <td><b>DINO-X Pro</b> <small>(Mask Head)</small></td> <td>37.9</td> <td>43.8</td> <td>46.7</td> <td>38.5</td> <td>44.4</td> </tr> </tbody> </table>

API Usage

Installation

pip install -r requirements.txt

Note: If you encounter some errors with API, please install the latest version of dds-cloudapi-sdk:

pip install dds-cloudapi-sdk --upgrade

Register on Offical Website to Get API Token

Run local API demos

Open-World Object Detection and Segmentation

Open-world detection means users can detect anything with text prompts, try this feature by setting your API token in demo.py and run local demo:

python demo.py

After running the local demo, the annotated image will be saved at: ./outputs/open_world_detection

<details> <summary> Demo Image Visualization </summary>

With the text prompt "wheel . eye . helmet . mouse . mouth . vehicle . steering wheel . ear . nose", we will get the predicton results as follows:

<div align="center">
Demo ImageBox PredictionMask Prediction
</div> </details>

Prompt-Free Anything Detection and Segmentation

We've implemented a novel Prompt Free object detection feature, which means users do not need to provide any prompt and DINO-X will automatically recognize, detect and segment the objects in the provided images. You can try this feature with the following script after setting your API token:

python prompt_free_demo.py

After running the local demo, the annotated image will be saved at: ./outputs/prompt_free_detection_segmentation

<details> <summary> Demo Image Visualization </summary>

With the specific text prompt "<prompt_free>", we will get the predicton results as follows:

<div align="center">
Demo ImageBox PredictionMask Prediction
</div> </details>

Related Work

LICENSE

<details close> <summary> <b> DINO-X API License </b> </summary>

DINO-X is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Copyright (c) IDEA. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

</details>

BibTeX

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@misc{ren2024dinoxunifiedvisionmodel,
      title={DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding}, 
      author={Tianhe Ren and Yihao Chen and Qing Jiang and Zhaoyang Zeng and Yuda Xiong and Wenlong Liu and Zhengyu Ma and Junyi Shen and Yuan Gao and Xiaoke Jiang and Xingyu Chen and Zhuheng Song and Yuhong Zhang and Hongjie Huang and Han Gao and Shilong Liu and Hao Zhang and Feng Li and Kent Yu and Lei Zhang},
      year={2024},
      eprint={2411.14347},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.14347}, 
}
@misc{ren2024grounding,
      title={Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection}, 
      author={Tianhe Ren and Qing Jiang and Shilong Liu and Zhaoyang Zeng and Wenlong Liu and Han Gao and Hongjie Huang and Zhengyu Ma and Xiaoke Jiang and Yihao Chen and Yuda Xiong and Hao Zhang and Feng Li and Peijun Tang and Kent Yu and Lei Zhang},
      year={2024},
      eprint={2405.10300},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{jiang2024trex2genericobjectdetection,
      title={T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy}, 
      author={Qing Jiang and Feng Li and Zhaoyang Zeng and Tianhe Ren and Shilong Liu and Lei Zhang},
      year={2024},
      eprint={2403.14610},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2403.14610}, 
}
@misc{liu2024groundingdinomarryingdino,
      title={Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection}, 
      author={Shilong Liu and Zhaoyang Zeng and Tianhe Ren and Feng Li and Hao Zhang and Jie Yang and Qing Jiang and Chunyuan Li and Jianwei Yang and Hang Su and Jun Zhu and Lei Zhang},
      year={2024},
      eprint={2303.05499},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2303.05499}, 
}