Awesome
<img src="assets/som_logo.png" alt="Logo" width="40" height="40" align="left"> Set-of-Mark Visual Prompting for GPT-4V
:grapes: [Read our arXiv Paper] :apple: [Project Page]
Jianwei Yang*⚑, Hao Zhang*, Feng Li*, Xueyan Zou*, Chunyuan Li, Jianfeng Gao
* Core Contributors ⚑ Project Lead
Introduction
We present Set-of-Mark (SoM) prompting, simply overlaying a number of spatial and speakable marks on the images, to unleash the visual grounding abilities in the strongest LMM -- GPT-4V. Let's using visual prompting for vision!
GPT-4V + SoM Demo
https://github.com/microsoft/SoM/assets/3894247/8f827871-7ebd-4a5e-bef5-861516c4427b
🔥 News
-
[11/21] Thanks to Roboflow and @SkalskiP, a huggingface demo for SoM + GPT-4V is online! Try it out!
-
[11/07] We released the vision benchmark we used to evaluate GPT-4V with SoM prompting! Check out the benchmark page!
-
[11/07] Now that GPT-4V API has been released, we are releasing a demo integrating SoM into GPT-4V!
export OPENAI_API_KEY=YOUR_API_KEY
python demo_gpt4v_som.py
- [10/23] We released the SoM toolbox code for generating set-of-mark prompts for GPT-4V. Try it out!
🔗 Fascinating Applications
Fascinating applications of SoM in GPT-4V:
- [11/13/2023] Smartphone GUI Navigation boosted by Set-of-Mark Prompting
- [11/05/2023] Zero-shot Anomaly Detection with GPT-4V and SoM prompting
- [10/21/2023] Web UI Navigation Agent inspired by Set-of-Mark Prompting
- [10/20/2023] Set-of-Mark Prompting Reimplementation by @SkalskiP from Roboflow
🔗 Related Works
Our method compiles the following models to generate the set of marks:
- Mask DINO: State-of-the-art closed-set image segmentation model
- OpenSeeD: State-of-the-art open-vocabulary image segmentation model
- GroundingDINO: State-of-the-art open-vocabulary object detection model
- SEEM: Versatile, promptable, interactive and semantic-aware segmentation model
- Semantic-SAM: Segment and recognize anything at any granularity
- Segment Anything: Segment anything
We are standing on the shoulder of the giant GPT-4V (playground)!
:rocket: Quick Start
- Install segmentation packages
# install SEEM
pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git@package
# install SAM
pip install git+https://github.com/facebookresearch/segment-anything.git
# install Semantic-SAM
pip install git+https://github.com/UX-Decoder/Semantic-SAM.git@package
# install Deformable Convolution for Semantic-SAM
cd ops && sh make.sh && cd ..
# common error fix:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
- Download the pretrained models
sh download_ckpt.sh
- Run the demo
python demo_som.py
And you will see this interface:
Deploy to AWS
To deploy SoM to EC2 on AWS via Github Actions:
- Fork this repository and clone your fork to your local machine.
- Follow the instructions at the top of
deploy.py
.
:point_right: Comparing standard GPT-4V and its combination with SoM Prompting
:round_pushpin: SoM Toolbox for image partition
Users can select which granularity of masks to generate, and which mode to use between automatic (top) and interactive (bottom). A higher alpha blending value (0.4) is used for better visualization.
:unicorn: Interleaved Prompt
SoM enables interleaved prompts which include textual and visual content. The visual content can be represented using the region indices. <img width="975" alt="Screenshot 2023-10-18 at 10 06 18" src="https://github.com/microsoft/SoM/assets/34880758/859edfda-ab04-450c-bd28-93762460ac1d">
:medal_military: Mark types used in SoM
:volcano: Evaluation tasks examples
<img width="946" alt="Screenshot 2023-10-18 at 10 12 18" src="https://github.com/microsoft/SoM/assets/34880758/f5e0c0b0-58de-4b60-bf01-4906dbcb229e">Use case
:tulip: Grounded Reasoning and Cross-Image Reference
<img width="972" alt="Screenshot 2023-10-18 at 10 10 41" src="https://github.com/microsoft/SoM/assets/34880758/033cd16c-876c-4c03-961e-590a4189bc9e">In comparison to GPT-4V without SoM, adding marks enables GPT-4V to ground the reasoning on detailed contents of the image (Left). Clear object cross-image references are observed on the right. 17
:camping: Problem Solving
<img width="972" alt="Screenshot 2023-10-18 at 10 18 03" src="https://github.com/microsoft/SoM/assets/34880758/8b112126-d164-47d7-b18c-b4b51b903d57">Case study on solving CAPTCHA. GPT-4V gives the wrong answer with a wrong number of squares while finding the correct squares with corresponding marks after SoM prompting.
:mountain_snow: Knowledge Sharing
<img width="733" alt="Screenshot 2023-10-18 at 10 18 44" src="https://github.com/microsoft/SoM/assets/34880758/dc753c3f-ada8-47a4-83f1-1576bcfb146a">Case study on an image of dish for GPT-4V. GPT-4V does not produce a grounded answer with the original image. Based on SoM prompting, GPT-4V not only speaks out the ingredients but also corresponds them to the regions.
:mosque: Personalized Suggestion
<img width="733" alt="Screenshot 2023-10-18 at 10 19 12" src="https://github.com/microsoft/SoM/assets/34880758/88188c90-84f2-49c6-812e-44770b0c2ca5">SoM-pormpted GPT-4V gives very precise suggestions while the original one fails, even with hallucinated foods, e.g., soft drinks
:blossom: Tool Usage Instruction
<img width="734" alt="Screenshot 2023-10-18 at 10 19 39" src="https://github.com/microsoft/SoM/assets/34880758/9b35b143-96af-41bd-ad83-9c1f1e0f322f"> Likewise, GPT4-V with SoM can help to provide thorough tool usage instruction , teaching users the function of each button on a controller. Note that this image is not fully labeled, while GPT-4V can also provide information about the non-labeled buttons.:sunflower: 2D Game Planning
<img width="730" alt="Screenshot 2023-10-18 at 10 20 03" src="https://github.com/microsoft/SoM/assets/34880758/0bc86109-5512-4dee-aac9-bab0ef96ed4c">GPT-4V with SoM gives a reasonable suggestion on how to achieve a goal in a gaming scenario.
:mosque: Simulated Navigation
<img width="729" alt="Screenshot 2023-10-18 at 10 21 24" src="https://github.com/microsoft/SoM/assets/34880758/7f139250-5350-4790-a35c-444ec2ec883b">:deciduous_tree: Results
We conduct experiments on various vision tasks to verify the effectiveness of our SoM. Results show that GPT4V+SoM outperforms specialists on most vision tasks and is comparable to MaskDINO on COCO panoptic segmentation.
:black_nib: Citation
If you find our work helpful for your research, please consider citing the following BibTeX entry.
@article{yang2023setofmark,
title={Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V},
author={Jianwei Yang and Hao Zhang and Feng Li and Xueyan Zou and Chunyuan Li and Jianfeng Gao},
journal={arXiv preprint arXiv:2310.11441},
year={2023},
}