Home

Awesome

<div align="center"> <h1> In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation </h1> </div> <div align="center"> <h3><a href=http://dahyun-kang.github.io>Dahyun Kang</a> &nbsp;&nbsp;&nbsp;&nbsp; <a href=http://cvlab.postech.ac.kr/~mcho/>Minsu Cho</a></h3> </div> <br /> <div align="center"> <a href="https://arxiv.org/abs/2408.04961"><img src="https://img.shields.io/badge/arXiv-2408.04961-b31b1b.svg"/></a> <a href="http://cvlab.postech.ac.kr/research/lazygrounding"><img src="https://img.shields.io/static/v1?label=project homepage&message=LaVG&color=9cf"/></a> </div> <br /> <div align="center"> <img src="datasets/main.png" alt="result"/> </div>

This repo is the official implementation of the ECCV 2024 paper In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

Conda installation command

conda env create -f environment.yml --prefix $YOURPREFIX

$YOUPREFIX is typically /home/$USER/anaconda3

Dependencies

This repo is built on CLIP, SCLIP, and MMSegmentation.

mim install mmcv==2.0.1 mmengine==0.8.4 mmsegmentation==1.1.1
pip install ftfy regex yapf==0.40.1

Dataset preparation

Please make it compatible with Pascal VOC 2012, Pascal Context, COCO stuff 164K, COCO object, ADEChallengeData2016, and Cityscapes following the MMSeg data preparation. The COCO-Object dataset can be converted from COCO-Stuff164k by executing the following command:

python datasets/cvt_coco_object.py PATH_TO_COCO_STUFF164K -o PATH_TO_COCO164K

Place them under $yourdatasetroot/ directory such that:

    $yourdatasetroot/
    ├── ADEChallengeData2016/
    │   ├── annotations/
    │   ├── images/
    │   ├── ...
    ├── VOC2012/
    │   ├── Annotations/
    │   ├── JPEGImages/
    │   ├── ...
    ├── coco_stuff164k/
    │   ├── annotations/
    │   ├── images/
    │   ├── ...
    ├── Cityscapes/
    │   ├── gtFine/
    │   ├── leftImg8bit/
    │   ├── ...
    ├── ...

1) Panoptic Cut for unsupervised object mask discovery

cd panoptic_cut
python predict.py \
    --logs panoptic_cut \
    --dataset {coco_object, coco_stuff, ade20k, voc21, voc20, context60, context59, cityscapes} \
    --datasetroot $yourdatasetroot

The checkpoints for the panoptic mask discovery is found below google drive:

mask prediction root after stage 1)benchmark idGoogle drive link
coco_stuff164kcoco_object, coco_stuff164klink to download (84.5 MB)
VOC2012context59, context60, voc20, voc21link to download (66.7 MB)
ADEChallengeData2016ade20klink to download (29.4 MB)
Cityscapescityscapeslink to download (23.1 MB)

Place them under lavg/panoptic_cut/pred/ directory such that:

    lavg/panoptic_cut/pred/panoptic_cut/
    ├── ADEChallengeData2016/
    │   ├── ADE_val_00000001.pth
    │   ├── ADE_val_00000002.pth
    │   ├── ...
    ├── VOC2012/
    │   ├── 2007_000033.pth
    │   ├── 2007_000042.pth
    │   ├── ...
    ├── coco_stuff164k/
    │   ├── 000000000139.pth
    │   ├── 000000000285.pth
    │   ├── ...
    ├── Cityscapes/
    │   ├── frankfurt_000000_000294_leftImg8bit.pth
    │   ├── ...

2) Visual grounding & Segmentation evaluation

Update $yourdatasetroot in configs/cfg_*.py

cd lavg
python eval.py --config ./configs/{cfg_context59/cfg_context60/cfg_voc20/cfg_voc21}.py --maskpred_root VOC2012/panoptic_cut
python eval.py --config ./configs/cfg_ade20k.py --maskpred_root ADEChallengeData2016/panoptic_cut
python eval.py --config ./configs/{cfg_coco_object/cfg_coco_stuff164k}.py --maskpred_root coco_stuff164k/panoptic_cut
python eval.py --config ./configs/cfg_city_scapes.py --maskpred_root Cityscapes/panoptic_cut

The run is a single-GPU compatible.

Quantitative performance (mIoU, %) on open-vocabulary semantic segmentation benchmarks

<table> <tr> <td></td> <td colspan="3" align="center"><strong>With background category</strong></td> <td colspan="5" align="center"><strong>Without background category</strong></td> </tr> <tr> <td align="left">Method</td> <td align="center">VOC21</td> <td align="center">Context60</td> <td align="center">COCO-obj</td> <td align="center">VOC20</td> <td align="center">Context59</td> <td align="center">ADE</td> <td align="center">COCO-stuff</td> <td align="center">Cityscapes</td> </tr> <tr> <td align="left">LaVG</td> <td align="center">62.1</td> <td align="center">31.6</td> <td align="center">34.2</td> <td align="center">82.5</td> <td align="center">34.7</td> <td align="center">15.8</td> <td align="center">23.2</td> <td align="center">26.2</td> </tr> </table>

Related repos

Our project refers to and heavily borrows some the codes from the following repos:

Acknowledgements

This work was supported by Samsung Electronics (IO201208-07822-01), the NRF grant (NRF-2021R1A2C3012728 (45%), and the IITP grants (RS-2022-II220959: Few-Shot Learning of Causal Inference in Vision and Language for Decision Making (50%), RS-2019-II191906: AI Graduate School Program at POSTECH (5%)) funded by Ministry of Science and ICT, Korea. We also thank Sua Choi for her helpful discussion.

BibTex source

If you find our code or paper useful, please consider citing our paper:

@inproceedings{kang2024lazy,
  title={In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation},
  author={Kang, Dahyun and Cho, Minsu},
  booktitle={European Conference on Computer Vision},
  year={2024}
}