Awesome
SPINO: Few-Shot Panoptic Segmentation With Foundation Models
arXiv | IEEE Xplore | Website | Video
This repository is the official implementation of the paper:
<p align="center"> <img src="./assets/spino_overview.png" alt="Overview of SPINO approach" width="800" /> </p>Few-Shot Panoptic Segmentation With Foundation Models
Markus Käppeler*, Kürsat Petek*, Niclas Vödisch*, Wolfram Burgard, and Abhinav Valada. <br> *Equal contribution. <br>
IEEE International Conference on Robotics and Automation (ICRA), 2024
If you find our work useful, please consider citing our paper:
@inproceedings{kaeppeler2024spino,
title={Few-Shot Panoptic Segmentation With Foundation Models},
author={Käppeler, Markus and Petek, Kürsat and Vödisch, Niclas and Burgard, Wolfram and Valada, Abhinav},
booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
year={2024},
pages={7718-7724}
}
📔 Abstract
Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments.
👩💻 Code
🏗 Setup
⚙️ Installation
- Create conda environment:
conda create --name spino python=3.8
- Activate environment:
conda activate spino
- Install dependencies:
pip install -r requirements.txt
- Install torch, torchvision and cuda:
pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
- Compile deformable attention:
cd panoptic_segmentation_model/external/ms_deformable_attention & sh make.sh
💻 Development
- Install pre-commit githook scripts:
pre-commit install
- Upgrade isort to 5.12.0:
pip install isort
- Update pre-commit:
pre-commit autoupdate
- Linter (pylint) and formatter (yapf, iSort) settings can be set in pyproject.toml.
🏃 Running the Code
🎨 Pseudo-label generation
To generate pseudo-labels for the Cityscapes dataset, please set the path to the dataset in the configuration files (see list below).
Then execute run_cityscapes.sh from the root of the panoptic_label_generator
folder.
This script will perform the following steps:
- Train the semantic segmentation module using the configuration file configs/semantic_cityscapes.yaml.
- Train the boundary estimation module using the configuration file configs/boundary_cityscapes.yaml.
- Generate the panoptic pseudo-labels using the configuration file configs/instance_cityscapes.yaml.
We also support the KITTI-360 dataset. To generate pseudo-labels for KITTI-360, please adapt the corresponding configuration files.
Instead of training the modules from scratch, you can also use the pretrained weights provided at these links:
- Cityscapes: https://drive.google.com/file/d/1FjJYpkEO9enpsahevD8PMn3nP_O0sNnT/view?usp=sharing
- KITTI-360: https://drive.google.com/file/d/1Eod444VoRLKw6dOeDSLuvfUQlJ5FAwM_/view?usp=sharing
🧠 Panoptic segmentation model
To train a panoptic segmentation model on a given dataset, e.g., the generated pseudo-labels, execute train.sh.
Before running the code, specify all settings:
- python_env: Set the name of the conda environment (e.g. "spino")
- alias_python: Set the path of the python binary to be used
- WANDB_API_KEY: Set the wand API key of your account
- CUDA_VISIBLE_DEVICES Specifies the device ids of available GPUs
- Set all remaining arguments:
- nproc_per_node: Number of processes per node (usually node=GPU server), this should be equal to the number of devices specified in CUDA_VISIBLE_DEVICES
- master_addr: IP address of GPU server to run the code on
- master_port: Port to be used for server access
- run_name: Name of the current run, a folder will be created with this name including all the files to be created (pretrained weights, config file etc.) and this name will also appear on wandb
- project_root_dir: Path to where the folder with the run name will be created
- mode: Mode of the training, can be "train" or "eval"
- resume: If specified, the training will be resumed from the specified checkpoint
- pre_train: Only load the specified modules from the checkpoint
- freeze_modules: Freeze the specified modules during training
- filename_defaults_config: Filename of the default configuration file with all configuration parameters
- filename_config: Filename of the configuration file that acts relative to the default configuration file
- comment: Some string
- seed: Seed to initialize "torch", "random", and "numpy"
- Set available flags:
- eval: Only evaluate the model specified by resume
- debug: Start the training in debug mode
Additionally,
- ensure that the dataset path is set correctly in the corresponding config file, e.g., train_cityscapes_dino_adapter.yaml.
- set the
entity
andproject
parameters forwandb.init(...)
in misc/train_utils.py.
💾 Datasets
Cityscapes
Download the following files:
- leftImg8bit_sequence_trainvaltest.zip (324GB)
- gtFine_trainvaltest.zip (241MB)
- camera_trainvaltest.zip (2MB)
After extraction, one should obtain the following file structure:
── cityscapes
├── camera
│ └── ...
├── gtFine
│ └── ...
└── leftImg8bit_sequence
└── ...
KITTI-360
Download the following files:
- Perspective Images for Train & Val (128G): You can remove "01" in line 12 in
download_2d_perspective.sh
to only download the relevant images. - Test Semantic (1.5G)
- Semantics (1.8G)
- Calibrations (3K)
After extraction and copying of the perspective images, one should obtain the following file structure:
── kitti_360
├── calibration
│ ├── calib_cam_to_pose.txt
│ └── ...
├── data_2d_raw
│ ├── 2013_05_28_drive_0000_sync
│ └── ...
├── data_2d_semantics
│ └── train
│ ├── 2013_05_28_drive_0000_sync
│ └── ...
└── data_2d_test
├── 2013_05_28_drive_0008_sync
└── 2013_05_28_drive_0018_sync
👩⚖️ License
For academic usage, the code is released under the GPLv3 license. For any commercial purpose, please contact the authors.
🙏 Acknowledgment
This work was funded by the German Research Foundation (DFG) Emmy Noether Program grant No 468878300 and the European Union’s Horizon 2020 research and innovation program grant No 871449-OpenDR. <br><br>
<p float="left"> <a href="https://www.dfg.de/en/research_funding/programmes/individual/emmy_noether/index.html"><img src="./assets/dfg_logo.png" alt="DFG logo" height="100"/></a> <a href="https://opendr.eu/"><img src="./assets/opendr_logo.png" alt="OpenDR logo" height="100"/></a> </p>