Home

Awesome

<div align="center"> <img src="assets/figures/sphinx-v-logo.png" style="width: 8%; height: auto;" alt="Second Image"/> <img src="assets/figures/sphinx-v_text.png" style="width: 35%; height: auto;" alt="SPHINX-V Logo"/> </div> <div align="center">

๐ŸŽจ Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao

Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang and Hongsheng Li

</div> <div align="center">

Project Page arXiv Paper Static Badge Code License

[๐ŸŒ Project Page] [๐Ÿ“– Paper] [๐Ÿค— MDVP-Data] [๐Ÿค— MDVP-Bench] [๐Ÿค–๏ธ Model] [๐ŸŽฎ Demo]

</div>

๐Ÿ’ฅ News

<!-- ## ๐Ÿ’ช ToDo - - [x] Coming soon: - &nbsp;&nbsp;โœ… Coming soon: -->

๐Ÿ‘€ Introduction

The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. Therefore, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.

<p align="center"> <img src="assets/figures/fig1.jpg" width="90%"> <br> </p>

Specifically, the model is named SPHINX-V, a new multimodal large language model designed for visual prompting, equipped with a novel visual prompt encoder and a two-stage training strategy. SPHINX-V supports multiple visual prompts simultaneously across various types, significantly enhancing user flexibility and achieve a fine-grained and open-world understanding of visual prompts.

<p align="center"> <img src="assets/figures/fig2.jpg" width="90%"> <br> </p>

๐Ÿš€ Examples Show

<details> <summary>๐Ÿ” Natural Image Domain</summary> <p align="center"> <img src="assets/figures/ver1.jpg" width="100%"> <br> </p> </details> <details> <summary>๐Ÿ” OCR Image Domain</summary> <p align="center"> <img src="assets/figures/ver2.jpg" width="100%"> <br> </p> </details> <details> <summary>๐Ÿ” Mobile/Website Screenshot Domain</summary> <p align="center"> <img src="assets/figures/ver3.jpg" width="100%"> <br> </p> </details> <details> <summary>๐Ÿ” Multi-panel Image Domain</summary> <p align="center"> <img src="assets/figures/ver4.jpg" width="100%"> <br> </p> </details>

๐Ÿ› ๏ธ Install

  1. Clone this repository and navigate to Draw-and-Understand folder
git clone https://github.com/AFeng-x/Draw-and-Understand.git
cd Draw-and-Understand
  1. Install packages
# Create a new conda environment named 'sphinx-v' with Python 3.10
conda create -n sphinx-v python=3.10 -y
# Activate the 'sphinx-v' environment
conda activate sphinx-v
# Install required packages from 'requirements.txt'
pip install -r requirements.txt
  1. Optional: Install Flash-Attention
# Draw-and-Understand is powered by flash-attention for efficient attention computation.
pip install flash-attn --no-build-isolation
  1. Install Draw-and-Understand as Python Package
# go to the root directory of Draw-and-Understand
cd Draw-and-Understand
# install Draw-and-Understand
pip install -e .
# After this, you will be able to invoke โ€œimport SPHINX_Vโ€ without the restriction of working directory.
  1. To enable the segmentation ability shown in our official demo, SAM is also needed:
pip install git+https://github.com/facebookresearch/segment-anything.git

๐Ÿค–๏ธ Checkpoints

SPHINX-V-13b Stage-1 Pre-training Weight: ๐Ÿค—Hugging Face / Baidu

SPHINX-V-13b Stage-2 Fine-tunings Weight: ๐Ÿค—Hugging Face / Baidu

Other required weights and configurations: ๐Ÿค—Hugging Face

Please download them to your own machine. The file structure should appear as follows:

accessory/checkpoints/sphinx-v/stage2
โ”œโ”€โ”€ consolidated.00-of-02.model.pth
โ”œโ”€โ”€ consolidated.01-of-02.model.pth
โ”œโ”€โ”€ tokenizer.model
โ”œโ”€โ”€ config.json
โ””โ”€โ”€ meta.json
accessory/checkpoints/llama-2-13b
โ”œโ”€โ”€ params.json

accessory/checkpoints/tokenizer
โ”œโ”€โ”€ tokenizer.model

๐Ÿ“ MDVP-Dataset

<p align="center"> <img src="assets/figures/fig3.jpg" width="70%"> <br> </p>

๐Ÿš€ Training

๐Ÿ“ˆ Evaluation

See evaluation for details.

๐Ÿ›ฉ๏ธ Inference

We provide a simple example for inference in inference.py

You can launch this script with torchrun --master_port=1112 --nproc_per_node=1 inference.py

๐Ÿช Host Local Demo

๐Ÿ’ป requirments:

  1. For this demo, it needs to prepare the SPHINX-V stage-2 checkpoints and ViT-H SAM model, and place them in the accessory/checkpoints/ directory.
  2. Make sure you have installed Segment Anything.
  3. Run.
cd accessory/demos
bash run.sh

๐Ÿ’Œ Acknowledgement

๐Ÿ–Š๏ธ: Citation

If you find our Draw-and-Understand project useful for your research and applications, please kindly cite using this BibTeX:

@misc{lin2024drawandunderstand,
      title={Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want}, 
      author={Weifeng Lin and Xinyu Wei and Ruichuan An and Peng Gao and Bocheng Zou and Yulin Luo and Siyuan Huang and Shanghang Zhang and Hongsheng Li},
      year={2024},
      eprint={2403.20271},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}