Home

Awesome

SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

Christian Wilms, Tim Rolff, Maris Hillemann, Robert Johanson, Simone Frintrop

This repository contains the code of our ECCV'24 paper SOS: Segment Object System for Open-World Instance Segmentation With Object Priors including the SOS system and the study on object-focused SAM. For the results and pre-trained models, check the tables below.

[Paper], [Supplementary Material], [Video]

The Segment Object System (SOS) is an open-world instance segmentation system capable of segmenting arbitrary objects in scenes. It utilizes rich pre-trained DINO self-attention maps as object priors to roughly localize unannotated objects in a training dataset. Subsequently, it applies the modern Segment Anything Model (SAM) to produce pseudo annotations from these rough localizations. Finally, a vanilla Mask R-CNN system is trained on original and pseudo annotations to provide strong generalization ability to unannotated objects. Note that a key difference to vanilla SAM is the focus of SOS on objects and not all coherent regions.

Object segmentation results of SAM and SOS

Overall, SOS produces new state-of-the-art results on several open-world instance segmentation setups, showing strong generalization from annotated objects in training to unannotated objects during testing.

OWIS results of Mask R-CNN, GGN, and SOS

Installation

First, clone this repository with the --recursive option

git clone --recursive https://github.com/chwilms/SOS.git
cd SOS
git config -f .gitmodules submodule.SOS_SAM.branch main
git config -f .gitmodules submodule.SOS_MASKRCNN.branch main
git submodule update --recursive --remote

Depending on the parts of SOS that are needed, different installation requirements exist. If only the final Mask R-CNN in SOS is trained or tested with pre-trained weights, follow the installation instructions in the linked detectron2 repo. If SOS's Pseduo Annotation Creator is of interest, install the linked SAM repo and the requirements.txt in this repo. Similarly, the packages in the requirements.txt are needed to generate the prompts from the object priors. Note that only generate_CAM_prompts.py needs GPU support as well as torch, torchvision, and captum. However, further repositories are needed to create object priors like Contour, VOCUS2, DeepGaze, or DINO.

Usage

The entire SOS pipeline consists of five steps. Given the intermediate results for most steps (see below), it's possible to start at an almost arbitrary step. The pipeline starts with the object prior generation (step 1) with the subsequent creation of the object-focused point prompts (step 2), known as Object Localization Module in the paper. This is followed by the segment generation with SAM based on the prompts (step 3), the filtering of the segment, and the creation of the final annotations combining pseudo annotations and original annotations (step 4). The final step is to train and test Mask R-CNN based on the new annotations/pre-trained weights.

In the subsequent steps, we assume the COCO dataset with VOC classes as original annotations and enrich these annotations with pseudo annotations based on the DINO object prior.

Step 1: Object Priors

Generate the respective object priors, e.g., Contour, VOCUS2, DeepGaze, or DINO. For the object priors Dist, Spx, and CAM, the step is part of the prompt generation scripts (see step 2). For the Grid prior, directly move to step 3.

The result of this step is the object prior per training image in a suitable format.

Step 2: Prompt

Once the object priors are generated, pick the respective script from the ./prompt_generation and set the parameters accordingly. For instance, to generate the DINO-based prompts with default parameters given DINO self-attention maps in /data/SOS/dino_attMaps and the images in /data/SOS/coco/train2017, call

python generate_DINO_prompts.py /data/SOS/coco/train2017 /data/SOS/dino_attMaps /data/SOS/prompts/prompts_DINO.json

The result of this step is a file with the object-focused point prompts based on a given object prior.

Step 3: Segments

Based on the generated prompts, apply SAM from the linked sub-repo on the training images by calling applySAM.py with an appropriate SAM checkpoint and generating the output segments.

python applySAM.py /data/SOS/coco/train2017 /data/SOS/SAM_checkpoints/sam_vit_h_4b8939.pth /data/SOS/prompts/prompts_DINO.json /data/SOS/segments/segments_DINO.json

If the Grid object prior is used, directly call applySAM_grid.py without providing a prompt file

python applySAM_grid.py /data/SOS/coco/train2017 /data/SOS/SAM_checkpoints/sam_vit_h_4b8939.pth /data/SOS/segments/segments_Grid.json

The result of this step is a file with the object segments based on a given object prior.

Step 4: Annotations

Given the segments generated by SAM and the original annotations of the known classes like the VOC classes from the COCO train2017 dataset, this step creates the merged annotations by filtering the segments yielding pseudo annotations. Call combineAnnotations.py with paths to the original annotation, the SAM segments, and the output annotation file path as well as optional parameters

python combineAnnotations.py /data/SOS/coco/annotations/instances_train2017_voc.json /data/SOS/segments/segments_DINO.json /data/SOS/coco/annotations/instances_train2017_voc_SOS_DINO.json

The result of this step is a file with the merged annotations.

Step 5: Training/Testing Mask R-CNN

Using the merged annotations, this step trains a class-agnostic Mask R-CNN, resulting in the final SOS open-world instance segmentation system. To train Mask R-CNN in a class-agnostic manner and the merged annotation in /data/SOS/coco/annotations/instances_train2017_voc_SOS_DINO.json, use the linked detectron2 sub-repo and first provide the base directory of the data followed by calling the training script.

export DETECTRON2_DATASETS=/data/SOS/
./tools/train_net.py --config-file ./configs/COCO-OpenWorldInstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml --num-gpus 8

If necessary, change the annotation file in Mask R-CNN's configuration file or through the command line call. As described in the sub repo's readme, annotation files following the above naming convention will be registered automatically.

To test SOS's Mask R-CNN with pre-trained weights, call the training script with the --eval-only and a respective file for the pre-trained weights. Note that this will default to a test on the COCO val2017 dataset. For evaluation in this setup (cross-category, see paper), we use the code provided by Saito et al..

./tools/train_net.py --config-file ./configs/COCO-OpenWorldInstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml --num-gpus 8 --eval-only MODEL.WEIGHTS /data/SOS/maskrcnn_weights/SOS_DINO_coco_voc.pth

Prompts, Models, and Results

This section provides the intermediate results of SOS and our object prior study, including pre-trained models for the final Mask R-CNN system in SOS based on pseudo annotations and original annotations.

Study Results

Training dataset COCO train2017 dataset with original annotations for VOC classes (20 classes), test dataset COCO val2017 dataset with original annotations of non-VOC classes (60 classes). Note that we only use 1/4 of the full schedule for training Mask R-CNN here.

Obejct PriorAPARF $_1$
SOS+Grid3.836.56.9final merged annotations, pre-trained Mask R-CNN, OWIS detections
SOS+Dist3.427.46.0prompts, final merged annotations, pre-trained Mask R-CNN, OWIS detections
SOS+Spx5.634.89.6prompts, final merged annotations, pre-trained Mask R-CNN, OWIS detections
SOS+Contour5.636.69.7prompts, final merged annotations, pre-trained Mask R-CNN, OWIS detections
SOS+VOCUS26.137.710.5prompts, final merged annotations, pre-trained Mask R-CNN, OWIS detections
SOS+DeepGaze5.435.99.4prompts, final merged annotations, pre-trained Mask R-CNN, OWIS detections
SOS+CAM5.436.79.4prompts, final merged annotations, pre-trained Mask R-CNN, OWIS detections
SOS+DINO8.938.114.4prompts, final merged annotations, pre-trained Mask R-CNN, OWIS detections
SOS+U-Net7.337.312.2prompts, final merged annotations, pre-trained Mask R-CNN, OWIS detections

OWIS: Cross-category COCO (VOC) -> COCO (non-VOC)

Training dataset is COCO train2017 dataset with original annotations for VOC classes (20 classes), test dataset is COCO val2017 dataset with original annotations of non-VOC classes (60 classes).

Obejct PriorAPARF $_1$
Mask R-CNN1.08.21.8code
SAM3.648.16.7code
OLN4.228.47.3code
LDET4.324.87.3code
GGN4.928.38.4code
SWORD4.830.28.3
UDOS2.934.35.3code
SOS (ours)8.929.314.5final merged annotations, pre-trained Mask R-CNN, OWIS detections

OWIS: Cross-dataset COCO -> LVIS

Training dataset is COCO train2017 dataset with all original annotations (80 classes), test dataset is LVIS validation dataset with all original annotations.

Obejct PriorAPARF $_1$
Mask R-CNN7.523.611.4code
SAM6.845.111.8code
LDET6.724.810.5code
GGN6.527.010.5code
SOIS-25.2-
OpenInst-29.3-
UDOS3.924.96.7code
SOS (ours)8.133.313.3final merged annotations, pre-trained Mask R-CNN, OWIS detections

OWIS: Cross-dataset COCO -> ADE20k

Training dataset is COCO train2017 dataset with all original annotations (80 classes), test dataset is ADE20k validation dataset with all original annotations.

Obejct PriorAPARF $_1$
Mask R-CNN6.911.98.7code, OWIS detections
OLN-20.4-code
LDET9.518.512.6code
GGN9.721.013.3code
UDOS7.622.911.4code
SOS (ours)12.526.517.0final merged annotations, pre-trained Mask R-CNN, OWIS detections

OWIS: Cross-dataset COCO -> UVO

Training dataset is COCO train2017 dataset with all original annotations (80 classes), test dataset is UVO sparse dataset with all original annotations.

Obejct PriorAPARF $_1$
Mask R-CNN20.736.726.5code
SAM11.350.118.4code
OLN-41.4-code
LDET22.040.428.5code
GGN20.343.427.7code
UDOS10.643.117.0code
SOS (ours)20.942.328.0final merged annotations, pre-trained Mask R-CNN, OWIS detections

Cite SOS

If you use SOS or the study on the object priors to focus prompts in SAM, cite our paper:

@inproceedings{WilmsEtAlECCV2024,
  title = {{SOS}: Segment Object System for Open-World Instance Segmentation With Object Priors},
  author = {Christian Wilms and Tim Rolff and Maris Hillemann and Robert Johanson and Simone Frintrop},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year = {2024}
}