

M3I Pre-training





This repository is an official implementation of CVPR 2023 paper Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information.

By Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao Huang, Yu Qiao, Xiaogang Wang, Jie Zhou, Jifeng Dai.

Code will be available.


Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training), initially described in arxiv, is a simple yet effective one-stage pre-training paradigm. It can integrate existing pre-training methods (supervised pre-training, weakly-supervised pre-training and self-supervised pre-training) under an unified mutual information perspective and maintain all desired properties through a single-stage pre-training. Notably, we successfully pre-train a 1B model (InternImage-H) with M3I Pre-training and achieve new record 65.4 mAP on COCO detection test-dev, 62.5 mAP on LVIS detection minival, and 62.9 mIoU on ADE20k.

<p align="center"> <img src="./figs/fig1-comparison.png" alt="m3i pre-training" width="600"/> </p> <!-- ## Main Results **Results of InternImage-H** | Method | Model | #param | ImageNet | COCO | LVIS | ADE20k | |:---------------:|:-------------:|:------:|:--------:|:----:|:----:|:------:| | M3I Pre-training| InternImage-H | 1B | | | | | **Results of ViT-B/16** -->


If this work is helpful for your research, please consider citing the following BibTeX entry.

    author    = {Su, Weijie and Zhu, Xizhou and Tao, Chenxin and Lu, Lewei and Li, Bin and Huang, Gao and Qiao, Yu and Wang, Xiaogang and Zhou, Jie and Dai, Jifeng},
    title     = {Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {15888-15899}