Home

Awesome

<div align="center"> <h1> MAFT+ (ECCV 2024 oral) </h1> <h3>Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation</h3>

Siyu Jiao<sup>1,2</sup>,Hongguang Zhu<sup>1,2</sup>,Jiannan Huang<sup>1,3</sup>, Yao Zhao<sup>1,2</sup>, Yunchao Wei<sup>1,2</sup>, Humphrey Shi<sup>3,4</sup>,

<sup>1</sup> Beijing Jiaotong University, <sup>2</sup> Pengcheng Lab, <sup>3</sup> Georgia Institute of Technology, <sup>4</sup> Picsart AI Research (PAIR)

[Paper]

PWC PWC PWC PWC PWC PWC

</div> <div align="center"> <img src="resources/vis1.gif" width="48%"> <img src="resources/vis2.gif" width="48%"> </div>

Introduction

This work is an enhanced version of our NeurIPS paper MAFT.
Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ.

Installation

  1. Clone the repository
    git clone https://github.com/jiaosiyu1999/MAFT_Plus.git
    
  2. Navigate to the project directory
    cd MAFT_Plus
    
  3. Install the dependencies
    bash install.sh
    cd maft/modeling/pixel_decoder/ops
    sh make.sh
    

<span id="2"></span>

Data Preparation

See MAFT for reference (Preparing Datasets for MAFT). The data should be organized like:

datasets/
  ade/
      ADEChallengeData2016/
        images/
        annotations_detectron2/
      ADE20K_2021_17_01/
        images/
        annotations_detectron2/
  coco/
        train2017/
        val2017/
        stuffthingmaps_detectron2/
  VOCdevkit/
     VOC2012/
        images_detectron2/
        annotations_ovs/      
    VOC2010/
        images/
        annotations_detectron2_ovs/
            pc59_val/
            pc459_val/      

<span id="3"></span>

Usage

<span id="5"></span>

    # MAFT-Plus-Large (maftp-l)
    python train_net.py --config-file configs/semantic/train_semantic_large.yaml  --num-gpus 8

    # MAFT-Plus-Base (maftp-b)
    python train_net.py --config-file configs/semantic/train_semantic_base.yaml  --num-gpus 8

<span id="6"></span>

Cite

If this codebase is useful to you, please consider citing:

@inproceedings{jiao2024collaborative,
  title={Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation},
  author={Jiao, Siyu and Zhu, Hongguang and Huang, Jiannan and Zhao, Yao and Wei, Yunchao and Humphrey, Shi},
  booktitle={European Conference on Computer Vision},
  year={2024},
}

Acknowledgement

Mask2Former

FC-CLIP