

<p align="center"> <img src="./figures/SSA_title.png" alt="SSA Icon"/> </p>

Official repo, Web Demo

Semantic Segment Anything
Jiaqi Chen, Zeyu Yang, and Li Zhang
Zhang Vision Group, Fudan Univerisity

SAM is a powerful model for arbitrary object segmentation, while SA-1B is the largest segmentation dataset to date. However, SAM lacks the ability to predict semantic categories for each mask. (I) To address above limitation, we propose a pipeline on top of SAM to predict semantic category for each mask, called Semantic Segment Anything (SSA). (II) Moreover, our SSA can serve as an automated dense open-vocabulary annotation engine called Semantic segment anything labeling engine (SSA-engine), providing rich semantic category annotations for SA-1B or any other dataset. This engine significantly reduces the need for manual annotation and associated costs.

Web demo and API

šŸ¤” Why do we need SSA project?

šŸ‘ What SSA project can do?

āœˆļø SSA: Semantic segment anything

Before the introduction of SAM, most semantic segmentation application scenarios already had their own models. These models could provide rough category classifications for regions, but were blurry and imprecise at the edges, lacking accurate masks. To address this issue, we propose an open framework called SSA that leverages SAM to enhance the performance of existing models. Specifically, the original semantic segmentation models provide category predictions while the powerful SAM provides masks.

If you have already trained a semantic segmentation model on your dataset, you don't need to retrain a new SAM-based model for more accurate segmentation. Instead, you can continue to use the existing model as the Semantic branch. SAM's strong generalization and image segmentation abilities can improve the performance of the original model. It is worth noting that SSA is suitable for scenarios where the predicted mask boundaries by the original segmentor are not highly accurate. If the original model's segmentation is already very accurate, SSA may not provide a significant improvement.

SSA consists of two branches, Mask branch and Semantic branch, as well as a voting module that determines the category for each mask.

šŸš„ SSA-engine: Semantic segment anything labeling engine

SSA-engine is an automated annotation engine that serves as the initial semantic labeling for the SA-1B dataset. While human review and refinement may be required for more accurate labeling. Thanks to the combined architecture of close-set segmentation and open-vocabulary segmentation, SSA-engine produces satisfactory labeling for most samples and has the capability to provide more detailed annotations using image caption method.

This tool fills the gap in SA-1B's limited fine-grained semantic labeling, while also significantly reducing the need for manual annotation and associated costs. It has the potential to serve as a foundation for training large-scale visual perception models and more fine-grained CLIP models.

The SSA-engine consists of three components:

šŸ“– News

šŸ”„ 2023/04/14: SSA benchmarks semantic segmentation on ADE20K and Cityscapes.
šŸ”„ 2023/04/10: Semantic Segment Anything (SSA and SSA-engine) is released.
šŸ”„ 2023/04/05: SAM and SA-1B are released.


All results were tested on a single NVIDIA A6000 GPU.

1. Inference time

DatasetmodelInference time per image (s)Inference time per mask (s)
SA-1BSSA (Close set)1.1490.012
SA-1BSSA-engine (Open-vocabulary)33.3330.334

2. Memory usage

SSA (with SAM)

DatasetmodelGPU Memory (MB)


DatasetmodelGPU Memory without SAM (MB)GPU Memory with SAM (MB)

3. Close-set semantic segmentation on ADE20K and Cityscapes dataset

For the sake of convenience, we utilized different versions of Segformer from Hugging Face, which come with varying parameter sizes and accuracy levels (including B0, B2, and B5), to simulate semantic branches with less accurate masks. The results show that when the accuracy of original Semantic branch is NOT very high, SSA can lead to an improvement in mIoU.


ModelSemantic branchmIoU of Semantic branchmIoU of SSA


ModelSemantic branchmIoU of Semantic branchmIoU of SSA

Note that all Segformer checkpoint and data pipeline are sourced from Hugging Face released by NVIDIA, which shows lower mIoU compared to those on official repository.

4. Cross-domain segmentation on Foggy Driving

We also evaluate the performance of SSA on the Foggy Driving dataset, with OneFormer as Semantic branch. The weight and data pipeline of OneFormer is sourced from Hugging Face.

ModelTraining datasetvalidation datasetmIoU
SSACityscapesFoggy Driving55.61


Open-vocabulary prediction on SA-1B

Close-set semantic segmentation on Cityscapes

Close-set semantic segmentation on ADE20K

Cross-domain segmentation on Foggy Driving

šŸ’» Requirements

šŸ› ļø Installation

git clone git@github.com:fudan-zvg/Semantic-Segment-Anything.git
cd Semantic-Segment-Anything
conda env create -f environment.yaml
conda activate ssa
python -m spacy download en_core_web_sm
# install segment-anything
cd ..
git clone git@github.com:facebookresearch/segment-anything.git
cd segment-anything; pip install -e .; cd ../Semantic-Segment-Anything

šŸš€ Quick Start

1. SSA

1.1 Preparation

Dowload the ADE20K or Cityscapes dataset, and unzip them to the data folder.

Folder sturcture:

ā”œā”€ā”€ Semantic-Segment-Anything
ā”œā”€ā”€ data
ā”‚   ā”œā”€ā”€ ade
ā”‚   ā”‚   ā”œā”€ā”€ ADEChallengeData2016
ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ images
ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ training
ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ validation
ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ ADE_val_00002000.jpg
ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ ...
ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ test
ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ annotations
ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ training
ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ validation
ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ ADE_val_00002000.png
ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ ...
ā”‚   ā”œā”€ā”€ cityscapes
ā”‚   ā”‚   ā”œā”€ā”€ leftImg8bit
ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ train
ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ val
ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ frankfurt
ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ lindau
ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ munster
ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ munster_000173_000019_leftImg8bit.png
ā”‚   ā”‚   ā”œā”€ā”€ gtFine
ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ train
ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ val
ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ frankfurt
ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ lindau
ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ munster
ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ munster_000173_000019_gtFine_labelTrainIds.png
ā”‚   ā”‚   ā”œā”€ā”€ ...

Dowload the checkpoint of SAM and put it to the ckp folder.

mkdir ckp && cd ckp
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
cd ..

1.2 SSA inference

Run our SSA on ADE20K with 8 GPUs:

python scripts/main_ssa.py --ckpt_path ./ckp/sam_vit_h_4b8939.pth --save_img --world_size 8 --dataset ade20k --data_dir data/ade20k/ADEChallengeData2016/images/validation/ --gt_path data/ade20k/ADEChallengeData2016/annotations/validation/ --out_dir output_ade20k

Run our SSA on Cityscapes with 8 GPUs:

python scripts/main_ssa.py --ckpt_path ./ckp/sam_vit_h_4b8939.pth --save_img --world_size 8 --dataset cityscapes --data_dir data/cityscapes/leftImg8bit/val/ --gt_path data/cityscapes/gtFine/val/ --out_dir output_cityscapes

Run our SSA on Foggy Driving with 8 GPUs:

python scripts/main_ssa.py --data_dir data/Foggy_Driving/leftImg8bit/test/ --ckpt_path ckp/sam_vit_h_4b8939.pth --out_dir output_foggy_driving --save_img --world_size 8 --dataset foggy_driving --eval --gt_path data/Foggy_Driving/gtFine/test/ --model oneformer

1.3 SSA evaluation (after inference)

Get the evaluate result of ADE20K:

python scripts/evaluation.py --gt_path data/ade20k/ADEChallengeData2016/annotations/validation --result_path output_ade20k/ --dataset ade20k

Get the evaluate result of Cityscapes:

python scripts/evaluation.py --gt_path data/cityscapes/gtFine/val/ --result_path output_cityscapes/ --dataset cityscapes

Get the evaluate result of Foggy Driving:

# if you haven't downloaded the Foggy Driving dataset, you can run the following command to download it.
wget -P data https://data.vision.ee.ethz.ch/csakarid/shared/SFSU_synthetic/Downloads/Foggy_Driving.zip & unizp data/Foggy_Driving.zip -d data/

python scripts/evaluation.py --gt_path data/Foggy_Driving/gtFine/test/ --result_path output_foggy_driving/ --dataset foggy_driving

2. SSA-engine

Automatic annotation for your own dataset

Organize your dataset as follows:

ā”œā”€ā”€ Semantic-Segment-Anything
ā”œā”€ā”€ data
ā”‚   ā”œā”€ā”€ <The name of your dataset>
ā”‚   ā”‚   ā”œā”€ā”€ img_name_1.jpg
ā”‚   ā”‚   ā”œā”€ā”€ img_name_2.jpg
ā”‚   ā”‚   ā”œā”€ā”€ ...

Run our SSA-engine-base with 8 GPUs (The GPU memory needed is dependent on the size of the input images):

python scripts/main_ssa_engine.py --data_dir=data/<The name of your dataset> --out_dir=output --world_size=8 --save_img --sam --ckpt_path=ckp/sam_vit_h_4b8939.pth

If you want to run the SSA-engine-small, you can use the following command (add the --light_mode flag):

python scripts/main_ssa_engine.py --data_dir=data/<The name of your dataset> --out_dir=output --world_size=8 --save_img --sam --ckpt_path=ckp/sam_vit_h_4b8939.pth --light_mode

Automatic annotation for SA-1B

Download the SA-1B dataset and unzip it to the data/sa_1b folder.
Or you use your own dataset.

Folder sturcture:

ā”œā”€ā”€ Semantic-Segment-Anything
ā”œā”€ā”€ data
ā”‚   ā”œā”€ā”€ sa_1b
ā”‚   ā”‚   ā”œā”€ā”€ sa_223775.jpg
ā”‚   ā”‚   ā”œā”€ā”€ sa_223775.json
ā”‚   ā”‚   ā”œā”€ā”€ ...

Run our SSA-engine-base with 8 GPUs:

python scripts/main_ssa_engine.py --data_dir=data/sa_1b --out_dir=output --world_size=8 --save_img

Run the SSA-engine-small with 8 GPUs (add the --light_mode flag):

python scripts/main_ssa_engine.py --data_dir=data/sa_1b --out_dir=output --world_size=8 --save_img --light_mode

For each mask, we add two new fields (e.g. 'class_name': 'face' and 'class_proposals': ['face', 'person', 'sun glasses']). The class name is the most likely category for the mask, and the class proposals are the top-k most likely categories from Class proposal filter. k is set to 3 by default.

    'bbox': [81, 21, 434, 666],
    'area': 128047,
    'segmentation': {
        'size': [1500, 2250],
        'counts': 'kYg38l[18oeN8mY14aeN5\\Z1>'
    'predicted_iou': 0.9704002737998962,
    'point_coords': [[474.71875, 597.3125]],
    'crop_box': [0, 0, 1381, 1006],
    'id': 1229599471,
    'stability_score': 0.9598413705825806,
    'class_name': 'face',
    'class_proposals': ['face', 'person', 'sun glasses']

šŸ“ˆ Future work

We hope that excellent researchers in the community can come up with new improvements and ideas to do more work based on SSA. Some of our ideas are as follows:

šŸ˜„ Acknowledgement

šŸ“œ Citation

If you find this work useful for your research, please cite our github repo:

    title = {Semantic Segment Anything},
    author = {Chen, Jiaqi and Yang, Zeyu and Zhang, Li},
    howpublished = {\url{https://github.com/fudan-zvg/Semantic-Segment-Anything}},
    year = {2023}