Home

Awesome

Lenna: Language Enhanced Reasoning Detection Assistant

<a href='https://github.com/Meituan-AutoML/Lenna'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2312.02433'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>

<p> <div align="justify"> With the fast-paced development of multimodal large language models (MLLMs), we can now converse with AI systems in natural languages to understand images. However, the reasoning power and world knowledge embedded in the large language models have been much less investigated and exploited for image perception tasks. In this work, we propose <b>Lenna</b> a <b>L</b>anguage <b>e</b>nhanced reaso<b>n</b>ing detectio<b>n</b> <b>a</b>ssistant, which utilizes the robust multimodal feature representation of MLLMs, while preserving location information for detection. This is achieved by incorporating an additional <b>&lt;DET&gt;</b> token in the MLLM vocabulary that is free of explicit semantic context but serves as a prompt for the detector to identify the corresponding position. To evaluate the reasoning capability of Lenna, we construct a ReasonDet dataset to measure its performance on reasoning-based detection. For more details, please refer to the <a href=https://arxiv.org/pdf/2312.02433.pdf>paper</a>. </div> </p> <p> <p align="center"><img src="./assets/lenna.png" alt="teaser" width="600px" /></p> <p align="center">Lenna Architecture</p> </p>

Getting Started

1. Installation

2. Prepare Lenna checkpoint

4. Inference

Updates

Cite

@article{wei2023lenna,
  title={Lenna: Language enhanced reasoning detection assistant},
  author={Wei, Fei and Zhang, Xinyu and Zhang, Ailing and Zhang, Bo and Chu, Xiangxiang},
  journal={arXiv preprint arXiv:2312.02433},
  year={2023}
}

Acknowledgement

This repo benefits from LISA, GroundingDINO, LLaVA and Vicuna.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.