Home

Awesome

MultiMAE: Multi-modal Multi-task Masked Autoencoders

Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir

Website | arXiv | BibTeX

Open in Colab Hugging Face Spaces

Official PyTorch implementation and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders.

<p align="center"> <img src="assets/multimae_fig.png" width=100% height=100% class="center"> </p>

We introduce Multi-modal Multi-task Masked Autoencoders (MultiMAE), an efficient and effective pre-training strategy for Vision Transformers. Given a small random sample of visible patches from multiple modalities, the MultiMAE pre-training objective is to reconstruct the masked-out regions. Once pre-trained, a single MultiMAE encoder can then be used for both single-modal and multi-modal downstream transfer, yielding competitive to or significantly better results than the baselines.

Catalog

Pre-trained models

We provide the weights of our pre-trained MultiMAE ViT-B model, in MultiViT (multi-modal) format and timm (RGB-only) format.

For comparison, we also provide the weights of a MAE ViT-B model that we pre-trained using the official MAE codebase following the recommended settings.

MethodArch.Pre-training<br>modalitiesPre-training<br>epochsWeights<br>(MultiViT)Weights<br>(timm)Config
MAEViT-BRGB1600downloaddownloadSee MAE
MultiMAEViT-BRGB+D+S1600downloaddownloadlink

These pre-trained models can then be fine-tuned using this codebase to reach the following performance:

<table> <thead> <tr> <th>Method</th> <th>Classif. (@1)</th> <th colspan="7">Semantic Segmentation (mIoU)</th> <th>Depth (δ1)</th> </tr> </thead> <tbody> <tr> <td></td> <td align="center"> ImageNet-1K<br>(RGB)<br></td> <td align="center">ADE20K<br>(RGB)<br></td> <td align="center" colspan="3">Hypersim<br>(RGB / D / RGB + D)<br></td> <td align="center"colspan="3">NYUv2<br>(RGB / D / RGB + D)<br></td> <td align="center">NYUv2<br>(RGB)<br></td> </tr> <tr> <td>Sup. (DeiT)</td> <td align="center">81.8</td> <td align="center">45.8</td> <td align="center">33.9</td> <td align="center">-</td> <td align="center">-</td> <td align="center">50.1</td> <td align="center">-</td> <td align="center">-</td> <td align="center">80.7</td> </tr> <tr> <td>MAE</td> <td align="center"><b>83.3</b></td> <td align="center"><b>46.2</b></td> <td align="center">36.5</td> <td align="center">-</td> <td align="center">-<br></td> <td align="center">50.8</td> <td align="center">-</td> <td align="center">-</td> <td align="center">85.1</td> </tr> <tr> <td><b>MultiMAE</b></td> <td align="center"><b>83.3</b></td> <td align="center"><b>46.2</b></td> <td align="center"><b>37.0</b></td> <td align="center"><b>38.5</b></td> <td align="center"><b>47.6</b></td> <td align="center"><b>52.0</b></td> <td align="center"><b>41.4</b></td> <td align="center"><b>56.0</b></td> <td align="center"><b>86.4</b></td> </tr> </tbody> </table>

Model formats

We provide pre-trained weights in two different formats: the single-modal ViT / timm format, which is compatible with other popular ViT repositories (e.g., timm, DINO, MAE), and the multi-modal MultiMAE / MultiViT format, which is used throughout this codebase for multi-modal pre-training and fine-tuning. See multimae/multimae.py for the documentation and implementation of MultiMAE / MultiViT.

You can convert between these formats using the provided vit2multimae_converter.py and multimae2vit_converter.py scripts.

Usage

Set-up

See SETUP.md for set-up instructions.

Pre-training

See PRETRAINING.md for pre-training instructions.

Fine-tuning

See FINETUNING.md for fine-tuning instructions.

Demo & visualizations

For interactive demos, please see our website. Open our Colab notebook to play around with the visualization code, or simply upload an image to our Hugging Face Spaces demo.

Acknowledgement

This repository is built using the timm, DeiT, DINO, MoCo v3, BEiT, MAE-priv, and MAE repositories.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

If you find this repository helpful, please consider citing our work:

@article{bachmann2022multimae,
  author    = {Roman Bachmann and David Mizrahi and Andrei Atanov and Amir Zamir},
  title     = {{MultiMAE}: Multi-modal Multi-task Masked Autoencoders},
  booktitle = {European Conference on Computer Vision},
  year      = {2022},
}