Awesome
<div align="center"> <!-- <h1>JiuTian (九天) </h1> --> <h2 class="papername"> <img src="./assets/LION_logo.png" style="vertical-align: middle; height: 1em; padding: 0 0.2em;"> LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge </h2> <div> <div> <a href="https://scholar.google.com/citations?user=Mpg0w3cAAAAJ" target="_blank">Gongwei Chen</a>, <a href="https://www.slywiki.cn/" target="_blank">Leyang Shen</a>, <a href="https://rshaojimmy.github.io/" target="_blank">Rui Shao*</a>, <a href="https://xiang-deng-dl.github.io/" target="_blank">Xiang Deng</a>, <a href="https://liqiangnie.github.io/" target="_blank">Liqiang Nie*</a> </div>School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen<br> *Corresponding author
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2024
[Paper] [Project Page] [Video(YouTube)] [Video(bilibili)]
:fire: Details will be released. Stay tuned :beers: :+1:
</div> <br> <img src='assets/LION-Introduction.jpg' width='90%'> </div>If you find this work useful for your research, please kindly cite our paper and star our repo.
Updates
- [07/2024] Code and checkpoints are released.
- [02/2024] LION has been accepted by CVPR 2024.
- [11/2023] Arxiv paper released.
- [11/2023] Project page released.
Introduction
This is the github repository of LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge. In this work, we enhance MLLMs by integrating fine-grained spatial-aware visual knowledge and high-level semantic visual evidence, boosting capabilities and alleviating hallucinations.
The framework of the proposed LION model:
<div align="center"> <img src='./assets/LION-Method.jpg' width='100%'> </div>Installation
Download
git clone https://github.com/JiuTian-VL/JiuTian-LION.git
cd JiuTian-LION
Environment
conda create -n LION python=3.12
conda activate LION
conda install pip
pip install -r requirements.txt
Checkpoints
Version | Checkpoint |
---|---|
LION-FlanT5-XL | daybreaksly/LION-FlanT5-XL |
LION-FlanT5-XXL | daybreaksly/LION-FlanT5-XXL |
Usage
Prepare models
- Download the pre-trained vit model eva_vit_g.
- Download the pre-trained RAM model ram_swin_large_14m.
- Download the pre-trained FlanT5 model FlanT5-XL.
- Download the pre-trained BERT model bert-base-uncased
- Fill in the paths to these models into the corresponding locations in the config file
configs\models\lion_flant5xl.yaml
Inference
We provide inference examples for Image-Level and Region-Level tasks in playground.ipynb
.
Evaluation results
For <b>image-level</b> tasks, we focus on image captioning and Visual Question Answering (VQA). For <b>region-level</b> tasks, we evaluate LION on three REC datasets including RefCOCO, RefCOCO+ and RefCOCOg. The results, detailed in Table 1~2, highlight LION's superior performance compared to baseline models.
We further evaluate LION on a object hallucination benchmark(POPE) and the most popular MLLM benchmark (MMBench). The results in Table 1~2 show that LION has strong performances across various skills and also demonstrates a strong resistance to hallucinations, particularly in popular and adversarial settings in POPE.
Qualitative Comparison
More Examples
Citation
If you find this work useful for your research, please kindly cite our paper:
@inproceedings{chen2024lion,
title={LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge},
author={Chen, Gongwei and Shen, Leyang and Shao, Rui and Deng, Xiang and Nie, Liqiang},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}