Awesome

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

:fire::fire:[CVPR 2024] The official implementation of the paper "ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions"

:fire::fire:| Paper | ViT-CoMer知乎解读 | ViT-CoMer第三方微信公众号解读

The overall architecture of ViT-CoMer. ViT-CoMer is a two-branch architecture consisting of three components: (a) a plain ViT with L layers, which is evenly divided into N stages for feature interaction. (b) a CNN branch that employs the proposed Multi-Receptive Field Feature Pyramid (MRFP) module to provide multi-scale spatial features, and (c) a simple and efficient CNN- Transformer Bidirectional Fusion Interaction (CTI) module to integrate the features of the two branches at different stages, enhancing semantic information.

Highlights

We propose a novel dense prediction backbone by combining the plain ViT with CNN features. It effectively leverages various open-source pre-trained ViT weights and incorporates spatial pyramid convolutional features that address the lack of interaction among local ViT features and the challenge of single-scale representation.
ViT-CoMer-L achieves SOTA 64.3% AP on COCO val2017 without training on extra detection data , and 62.1% mIoU on ADE20K val.

Introduction

We present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has the following advantages: (1) We inject spatial pyramid multi-receptive field convolutional features into the ViT architecture, which effectively alleviates the problems of limited local information interaction and single-feature representation in ViT. (2) We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features, which is beneficial for handling dense prediction tasks.

Main Results

Comparisons with different backbones and frameworks. It can be seen that under similar model sizes, ViT-CoMer outper- forms other backbones in the two typical dense prediction tasks of COCO object detection and instance segmentation.

Comparisons with state-of-the-arts. We conduct experiments based on Co-DETR, using ViT-CoMer as the backbone, and initializing the model with multi-modal pre-training BEiTv2. As shown in Table 4, Our approach outperforms the existing SOTA algorithms without extra training data on COCO val2017, which strongly demonstrates the effectiveness of ViT-CoMer.

For segmentation, we conduct experiments based on Mask2Former using ViT-CoMer as the backbone, and initializing the model with multi-modal pre-training BEiTv2. As shown in Table 7, our method achieves com- parable performance to SOTA methods on ADE20K with fewer parameters.

News

[20240405] ViT-CoMer is selected as highlight in CVPR 2024
[20240318] we release segementation code and pre-trained weights
[20240315] we release ViT-CoMer-L with Co-DETR head configs, which achieves 64.3 AP on COCO 2017val
[20240313] we release detection code and pre-trained weights
[20240313] create repo

Quick Start

Citation

If you find ViT-CoMer useful in your research, please consider giving a star ⭐ and citing:

@inproceedings{xia2024vit,
  title={Vit-comer: Vision transformer with convolutional multi-scale feature interaction for dense predictions},
  author={Xia, Chunlong and Wang, Xinliang and Lv, Feng and Hao, Xin and Shi, Yifeng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5493--5502},
  year={2024}
}

Acknowledgements

Many thanks to following codes that help us a lot in building this codebase:

Contact

If you have any questions while using ViT-CoMer or would like to further discuss implementation details with us, please leave a message on issues or contact us directly via email: xiachunlong@baidu.com. We will reply as soon as possible.