Awesome
<div align='center'> <!-- Paper Title --> <h1><strong> (ECCV 2024) ProLab</strong>: <strong>Pro</strong>perty-level <strong>Lab</strong>el Space</h1> <h3><a href="https://arxiv.org/abs/2312.13764">A Semantic Space is Worth 256 Language Descriptions: <br>Make Stronger Segmentation Models with Descriptive Properties</a></h3> <!-- Authors --> <p> <a href="https://lambert-x.github.io/">Junfei Xiao</a><sup>1</sup>, <a href="https://zzzqzhou.github.io/">Ziqi Zhou</a><sup>2</sup>, <a href="https://scholar.google.com/citations?user=tpNZM2YAAAAJ&hl=en">Wenxuan Li</a><sup>1</sup>, <a href="https://voidrank.github.io/">Shiyi Lan</a><sup>3</sup>, <a href="https://meijieru.com/">Jieru Mei</a><sup>1</sup>, <a href="https://chrisding.github.io/">Zhiding Yu</a><sup>3</sup>, <br> <a href="https://bzhao.me/">Bingchen Zhao</a><sup>4</sup>, <a href="https://www.cs.jhu.edu/~ayuille/">Alan Yuille</a><sup>1</sup>, <a href="https://yuyinzhou.github.io/">Yuyin Zhou</a><sup>2</sup>, <a href="https://cihangxie.github.io/">Cihang Xie</a><sup>2</sup> </p> <!-- Institutions --> <p> <sup>1</sup><a href="https://www.jhu.edu/">Johns Hopkins University</a>, <sup>2</sup><a href="https://www.ucsc.edu/">UCSC</a>, <sup>3</sup><a href="https://www.nvidia.com/">NVIDIA</a>, <sup>4</sup><a href="https://www.ed.ac.uk/">University of Edinburgh</a> </p> <!-- Teaser Image --> <img src="images/github_teaser.png" alt="Teaser Image"> </div> <div align="center">Paper | Property-level Label Space | Model Zoo | Training & Evaluation
</div>News
- [07/07/24] 🔥 ProLab: Property-level Label Space is accepted to ECCV 2024. A camera-ready version is coming in the next 1~2 weeks paper. Stay tuned.
- [12/21/23] 🔥 ProLab: Property-level Label Space is released. We propose to retrieve descriptive properties grounded in common sense knowledge to build a property-level label space which makes strong interpretable segmentation models. Please checkout the paper.
Method
Emerged Generalization Ability
ProLab models have emerged generalization ability to out-of-domain categories and even unknown categories.
Contents
Getting Started
Our segmentation code is developed on top of MMSegmentation and ViT-Adapter.
Setup
We have two tested environments based on torch 1.9+cuda 11.1+MMSegmentation v0.20.2 and torch 1.13.1+torch11.7+MMSegmentation v0.27.0.
Environment 1 (torch 1.9+cuda 11.1+MMSegmentation v0.20.2)
conda create -n prolab python=3.8
conda activate prolab
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.4.2 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install timm==0.4.12
pip install mmdet==2.22.0 # for Mask2Former
pip install mmsegmentation==0.20.2
pip install -r requirements.txt
cd ops & sh make.sh # compile deformable attention
Environment 2 (torch 1.13.1+cuda 11.7+MMSegmentation v0.27.0)
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install mmcv-full==1.7.0 -f https://download.openmmlab.com/mmcv/dist/cu117/torch1.13.0/index.html
pip install timm==0.4.12
pip install mmdet==2.22.0 # may need modification on the limitation of mmcv version
pip install mmsegmentation==0.27.0
pip install -r requirements.txt
cd ops & sh make.sh # compile deformable attention
Data Preparation
ADE20K/Cityscapes/COCO Stuff/Pascal Context
Please follow the guidelines in MMSegmentation to download ADE20K, Cityscapes, COCO Stuff and Pascal Context.
BDD
Please visit the official website to download the BDD dataset.
Property-level Label Space
Descriptive Properties and Clustered Embeddings (Ready-to-use)
We provide the retrieved descriptive properties (with GPT-3.5) and property-level labels (language embeddings) .
Descriptive Properties Retrieval (Optional)
We provide generate_descrtiptions.ipynb using GPT 3.5 (API) and LLAMA-2 (local deploy) to retrieve descriptive properties.
Encode Descriptions into Embeddings (Optional)
We also provide generate_embeddings.ipynb to encode and cluster the descriptive properties into embeddings with Sentence Transformer (huggingface, paper) and BAAI-BGE models (huggingface, paper) step-by-step.
Model Zoo
ADE20K
Framework | Backbone | Pretrain | Lr schd | Crop Size | mIoU | Config | Checkpoint |
---|---|---|---|---|---|---|---|
UperNet | ViT-Adapter-B | DeiT-B | 320k | 512 | 49.0 | config | Google Drive |
UperNet | ViT-Adapter-L | BEiT-L | 160k | 640 | 58.2 | config | Google Drive |
UperNet | ViT-Adapter-L | BEiTv2-L | 80K | 896 | 58.7 | config | Google Drive |
COCO-Stuff-164K
Framework | Backbone | Pretrain | Lr schd | Crop Size | mIoU | Config | Checkpoint |
---|---|---|---|---|---|---|---|
UperNet | ViT-Adapter-B | DeiT-B | 160K | 512 | 45.4 | config | Google Drive |
Pascal Context
Framework | Backbone | Pretrain | Lr schd | Crop Size | mIoU | Config | Checkpoint |
---|---|---|---|---|---|---|---|
UperNet | ViT-Adapter-B | DeiT-B | 160K | 512 | 58.2 | config | Google Drive |
Cityscapes
Framework | Backbone | Pretrain | Lr schd | Crop Size | mIoU | Config | Checkpoint |
---|---|---|---|---|---|---|---|
UperNet | ViT-Adapter-B | DeiT-B | 160K | 768 | 81.4 | config | Google Drive |
BDD
Framework | Backbone | Pretrain | Lr schd | Crop Size | mIoU | Config | Checkpoint |
---|---|---|---|---|---|---|---|
UperNet | ViT-Adapter-B | DeiT-B | 160K | 768 | 65.7 | config | Google Drive |
Training & Evaluation
Training
The following example script is to train ViT-Adapter-B + UperNet on ADE20k on a single node with 8 gpus:
sh dist_train.sh configs/ADE20K/upernet_deit_adapter_base_512_320k_ade20k_bge_base.py 8
Evaluation
The following example script is to evaluate ViT-Adapter-B + UperNet on COCO_Stuff val on a single node with 8 gpus:
sh dist_test.sh configs/COCO_Stuff/upernet_deit_adapter_base_512_160k_coco_stuff_bge_base.py 8 --eval mIoU
Citation
If this paper is useful to your work, please cite:
@article{xiao2023semantic,
author = {Xiao, Junfei and Zhou, Ziqi and Li, Wenxuan and Lan, Shiyi and Mei, Jieru and Yu, Zhiding and Yuille, Alan and Zhou, Yuyin and Xie, Cihang},
title = {A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties},
journal = {arXiv preprint arXiv:2312.13764},
year = {2023},
}
Acknowledgement
GPT-3.5 and Llama-2 are used for retrieving descriptive properties.
Sentence Transformer and BAAI-BGE are used as description embedding models.
MMSegmentation and ViT-Adapter are used as the segmentation codebase.
Many thanks to all these great projects .