Home

Awesome

👀SEEM: Segment Everything Everywhere All at Once

We introduce SEEM that can Segment Everything Everywhere with Multi-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also work with any combination of prompts or generalize to custom prompts!

:grapes: [Read our arXiv Paper]   :apple: [Try our Demo]

One-Line Getting Started with Linux:

git clone git@github.com:UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git && cd Segment-Everything-Everywhere-All-At-Once/demo_code && sh run_demo.sh

:point_right: [New] Latest Checkpoints and Numbers:

COCORef-COCOgVOCSBD
MethodCheckpointbackbonePQmAPmIoUcIoUmIoUAP50NoC85NoC90NoC85NoC90
X-DecoderckptFocal-T50.839.562.457.663.271.6----
X-Decoder-oq201ckptFocal-L56.546.767.262.867.576.3----
SEEMckptFocal-T50.639.460.958.563.571.63.544.59**
SEEM-Davit-d356.246.865.363.268.376.62.993.895.939.23
SEEM-oq101ckptFocal-L56.246.465.562.867.776.23.043.85**

:fire: Related projects:

:fire: Other projects you may find interesting:

:rocket: Updates

<p float="left"> <img src="https://user-images.githubusercontent.com/11957155/233255289-35c0c1e2-35f7-48e4-a7e9-68da50c839d3.gif" width="400" /> <img src="https://user-images.githubusercontent.com/11957155/233526415-a0a44963-19a3-4e56-965a-afaa598e6127.gif" width="400" /> </p>

:bulb: Highlights

Inspired by the appealing universal interface in LLMs, we are advocating a universal, interactive multi-modal interface for any type of segmentation with ONE SINGLE MODEL. We emphasize 4 important features of SEEM below.

  1. Versatility: work with various types of prompts, for example, clicks, boxes, polygons, scribbles, texts, and referring image;
  2. Compositionaliy: deal with any compositions of prompts;
  3. Interactivity: interact with user in multi-rounds, thanks to the memory prompt of SEEM to store the session history;
  4. Semantic awareness: give a semantic label to any predicted mask;

SEEM design A brief introduction of all the generic and interactive segmentation tasks we can do.

:unicorn: How to use the demo

:volcano: An interesting example

An example of Transformers. The referred image is the truck form of Optimus Prime. Our model can always segment Optimus Prime in target images no matter which form it is in. Thanks Hongyang Li for this fun example.

<div align="center"> <img src="assets/transformers_gh.png" width = "700" alt="assets/transformers_gh.png" align=center /> </div>

:tulip: NERF Examples

<p float="left"> <img src="https://user-images.githubusercontent.com/11957155/234230320-2189056d-1c89-4f0c-88da-851d12e8323c.gif" width="400" /> <img src="https://user-images.githubusercontent.com/11957155/234231284-0adc4bae-ef90-41d3-9883-41f6407a883b.gif" width="400" /> </p>

:camping: Click, scribble to mask

With a simple click or stoke from the user, we can generate the masks and the corresponding category labels for it.

SEEM design

:mountain_snow: Text to mask

SEEM can generate the mask with text input from the user, providing multi-modality interaction with human.

example

<!-- <div align="center"> <img src="assets/text.png" width = "700" alt="assets/text.png" align=center /> </div> -->

:mosque: Referring image to mask

With a simple click or stroke on the referring image, the model is able to segment the objects with similar semantics on the target images. example

SEEM understands the spatial relationship very well. Look at the three zebras! The segmented zebras have similar positions with the referred zebras. For example, when the leftmost zebra is referred on the upper row, the leftmost zebra on the bottom row is segmented. example

:blossom: Referring image to video mask

No training on video data needed, SEEM works perfectly for you to segment videos with whatever queries you specify! example

:sunflower: Audio to mask

We use Whisper to turn audio into text prompt to segment the object. Try it in our demo!

<div align="center"> <img src="assets/audio.png" width = "900" alt="assets/audio.png" align=center /> </div> <!-- ## 🔥 Combination of different prompts to mask -->

:deciduous_tree: Examples of different styles

An example of segmenting a meme.

<div align="center"> <img src="assets/emoj.png" width = "500" alt="assets/emoj.png" align=center /> </div>

An example of segmenting trees in cartoon style.

<div align="center"> <img src="assets/trees_text.png" width = "700" alt="assets/trees_text.png" align=center /> </div>

An example of segmenting a Minecraft image.

<div align="center"> <img src="assets/minecraft.png" width = "700" alt="assets/minecraft.png" align=center /> </div> <!-- ![example](assets/minecraft.png?raw=true) --> An example of using referring image on a popular teddy bear.

example

Model

SEEM design

Comparison with SAM

In the following figure, we compare the levels of interaction and semantics of three segmentation tasks (edge detection, open-set, and interactive segmentation). Open-set Segmentation usually requires a high level of semantics and does not require interaction. Compared with SAM, SEEM covers a wider range of interaction and semantics levels. For example, SAM only supports limited interaction types like points and boxes, while misses high-semantic tasks since it does not output semantic labels itself. The reasons are: First, SEEM has a unified prompt encoder that encodes all visual and language prompts into a joint representation space. In consequence, SEEM can support more general usages. It has potential to extend to custom prompts. Second, SEEM works very well on text to mask (grounding segmentation) and outputs semantic-aware predictions.

<div align="center"> <img src="assets/compare.jpg" width = "500" alt="assets/compare.jpg" align=center /> </div> <!-- This figure shows a comparison of our model with concurrent work SAM on the level of interactions and semantics. The x-axis and y-axis denote the level of interaction and semantics, respectively. Three segmentation tasks are shown, including Open-set Segmentation, Edge detection, and Interactive Segmentation. These tasks have different levels of interactions and semantics. For example, Open-set Segmentation usually requires a high level of semantics and does not require interaction. Compared with SAM, our model covers a wider range of interaction and semantics levels. For example, SAM only supports limited interaction types like points and boxes, while misses high-semantic tasks since it does not output semantic labels itself. Note that although we do not report edge detection results, our model can support it by simply converting masks to edges. -->

:bookmark_tabs: Catelog

:cupid: Acknowledgements

<!-- ## Citation (update when paper is available on arxiv) If you find this project helpful for your research, please consider citing the following BibTeX entry. ```BibTex ``` -->