Home

Awesome

Semantic-SAM: Segment and Recognize Anything at Any Granularity

In this work, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. We have trained on the whole SA-1B dataset and our model can reproduce SAM and beyond it.

:grapes: [Read our arXiv Paper]  

:apple: [Try Auto Generation with Controllable Granularity Demo]   :apple: [Try Interactive Multi-Granularity Demo]  

:rocket: Features

:fire: Reproduce SAM. SAM training is a sub-task of ours. We have released the training code to reproduce SAM training.

:fire: Beyond SAM. Our newly proposed model offers the following attributes from instance to part level:

:rocket: News

:fire: We release the training and inference code and demo link of DINOv, which can handle in-context visual prompts for open-set and referring detection & segmentation. Check it out!

:fire: We release the demo code for controllable mask auto-generation with different granularity prompts! levels_dog2

Segment everything for one image. We output controllable granularity masks from semantic, instance to part level when using different granularity prompts.

:fire: We release the demo code for mask auto-generation! tank_auto

Segment everything for one image. We output more masks with more granularity.

:fire: We release the demo code for interactive segmentation! character One click to output up to 6 granularity masks. Try it in our demo!

:fire: We release the training and inference code and checkpoints (SwinT, SwinL) trained on SA-1B!

:fire: We release the training code to reproduce SAM!

teaser_xyz

Our model supports a wide range of segmentation tasks and their related applications, including:

👉: Related projects:

:unicorn: Getting Started

:hammer_and_wrench: Installation

pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu113
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
pip install git+https://github.com/cocodataset/panopticapi.git
git clone https://github.com/UX-Decoder/Semantic-SAM
cd Semantic-SAM
python -m pip install -r requirements.txt

export DATASET=/pth/to/dataset  # path to your coco data

:star: A few lines to get generated results

First download a checkpoint from model zoo.

from semantic_sam import prepare_image, plot_multi_results, build_semantic_sam, SemanticSAMPredictor
original_image, input_image = prepare_image(image_pth='examples/dog.jpg')  # change the image path to your image
mask_generator = SemanticSAMPredictor(build_semantic_sam(model_type='<model_type>', ckpt='</your/ckpt/path>')) # model_type: 'L' / 'T', depends on your checkpint
iou_sort_masks, area_sort_masks = mask_generator.predict_masks(original_image, input_image, point='<your prompts>') # input point [[w, h]] relative location, i.e, [[0.5, 0.5]] is the center of the image
plot_multi_results(iou_sort_masks, area_sort_masks, original_image, save_path='../vis/')  # results and original images will be saved at save_path
from semantic_sam import prepare_image, plot_results, build_semantic_sam, SemanticSamAutomaticMaskGenerator
original_image, input_image = prepare_image(image_pth='examples/dog.jpg')  # change the image path to your image
mask_generator = SemanticSamAutomaticMaskGenerator(build_semantic_sam(model_type='<model_type>', ckpt='</your/ckpt/path>')) # model_type: 'L' / 'T', depends on your checkpint
masks = mask_generator.generate(input_image)
plot_results(masks, original_image, save_path='../vis/')  # results and original images will be saved at save_path

Advanced usage:

mask_generator = SemanticSamAutomaticMaskGenerator(semantic_sam, level=[1]) # [1] and [2] for semantic level.
mask_generator = SemanticSamAutomaticMaskGenerator(semantic_sam, level=[3]) # [3] for instance level.
mask_generator = SemanticSamAutomaticMaskGenerator(semantic_sam, level=[6]) # [4], [5], [6] for different part level.

:mosque: Data preparation

Please refer to prepare SA-1B data. Let us know if you need more instructions about it.

:volcano: Model Zoo

The currently released checkpoints are only trained with SA-1B data.

<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="bottom">Name</th> <th valign="bottom">Training Dataset</th> <th valign="bottom">Backbone</th> <th valign="bottom">1-IoU@Multi-Granularity</th> <th valign="bottom">1-IoU@COCO(Max|Oracle)</th> <th valign="bottom">download</th> <tr><td align="left">Semantic-SAM | <a href="configs/semantic_sam_only_sa-1b_swinT.yaml">config</a></td> <td align="center">SA-1B</td> <td align="center">SwinT</td> <td align="center">88.1</td> <td align="center">54.5|73.8</td> <td align="center"><a href="https://github.com/UX-Decoder/Semantic-SAM/releases/download/checkpoint/swint_only_sam_many2many.pth">model</a></td> <tr><td align="left">Semantic-SAM | <a href="configs/semantic_sam_only_sa-1b_swinL.yaml">config</a></td> <td align="center">SA-1B</td> <td align="center">SwinL</td> <td align="center">89.0</td> <td align="center">55.1|74.1</td> <td align="center"><a href="https://github.com/UX-Decoder/Semantic-SAM/releases/download/checkpoint/swinl_only_sam_many2many.pth">model</a></td> </tbody></table>

:arrow_forward: Demo

For interactive segmentation.

python demo.py --ckpt /your/ckpt/path

For mask auto-generation.

python demo_auto_generation.py --ckpt /your/ckpt/path

:sunflower: Evaluation

We do zero-shot evaluation on COCO val2017. $n is the number of gpus you use

For SwinL backbone

python train_net.py --eval_only --resume --num-gpus $n --config-file configs/semantic_sam_only_sa-1b_swinL.yaml COCO.TEST.BATCH_SIZE_TOTAL=$n  MODEL.WEIGHTS=/path/to/weights

For SwinT backbone

python train_net.py --eval_only --resume --num-gpus $n --config-file configs/semantic_sam_only_sa-1b_swinT.yaml COCO.TEST.BATCH_SIZE_TOTAL=$n  MODEL.WEIGHTS=/path/to/weights

:star: Training

We currently release the code of training on SA-1B only. Complete training with semantics will be released later. $n is the number of gpus you use before running the training code, you need to specify your training data of SA-1B.

export SAM_DATASETS=/pth/to/dataset
export SAM_SUBSET_START=$start
export SAM_SUBSET_END=$end

We convert SA-1B data into 100 tsv files. start(int, 0-99) is the start of your SA-1B data index and end(int, 0-99) is the end of your data index. If you are not using the tsv data formats, you can refer to this json registration for SAM for a reference.

For SwinL backbone

python train_net.py --resume --num-gpus $n  --config-file configs/semantic_sam_only_sa-1b_swinL.yaml COCO.TEST.BATCH_SIZE_TOTAL=$n  SAM.TEST.BATCH_SIZE_TOTAL=$n  SAM.TRAIN.BATCH_SIZE_TOTAL=$n

For SwinT backbone

python train_net.py --resume --num-gpus $n  --config-file configs/semantic_sam_only_sa-1b_swinT.yaml COCO.TEST.BATCH_SIZE_TOTAL=$n  SAM.TEST.BATCH_SIZE_TOTAL=$n  SAM.TRAIN.BATCH_SIZE_TOTAL=$n
**We also support training to reproduce SAM**
```shell
python train_net.py --resume --num-gpus $n  --config-file configs/semantic_sam_reproduce_sam_swinL.yaml COCO.TEST.BATCH_SIZE_TOTAL=$n  SAM.TEST.BATCH_SIZE_TOTAL=$n  SAM.TRAIN.BATCH_SIZE_TOTAL=$n

This is a swinL backbone. The only difference of this script is to use many-to-one matching and 3 prompts as in SAM.

👀 Comparison with SAM and SA-1B Ground-truth

compare_sam_v3

(a)(b) are the output masks of our model and SAM, respectively. The red points on the left-most image of each row are the user clicks. (c) shows the GT masks that contain the user clicks. The outputs of our model have been processed to remove duplicates.

:deciduous_tree: Learned prompt semantics

levels

We visualize the prediction of each content prompt embedding of points with a fixed order for our model. We find all the output masks are from small to large. This indicates each prompt embedding represents a semantic level. The red point in the first column is the click.

:sauropod: Method

method_xyz

:medal_military: Experiments

We also show that jointly training SA-1B interactive segmentation and generic segmentation can improve the generic segmentation performance. coco

We also outperform SAM on both mask quality and granularity completeness, please refer to our paper for more experimental details.

<details open> <summary> <font size=8><strong>:bookmark_tabs: Todo list</strong></font> </summary> </details>

:hearts: Acknowledgement

Our model is related to Mask DINO and OpenSeeD. We also thank Segment Anything for the SA-1B data.

:black_nib: Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{li2023semantic,
  title={Semantic-SAM: Segment and Recognize Anything at Any Granularity},
  author={Li, Feng and Zhang, Hao and Sun, Peize and Zou, Xueyan and Liu, Shilong and Yang, Jianwei and Li, Chunyuan and Zhang, Lei and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2307.04767},
  year={2023}
}
}