Home

Awesome

MAWS

[Paper] [Colab] [BibTex] [Website]

Repository for the strong foundational MAWS + MAE models at all sizes ranging from <100M parameters to >6.5B parameters, from the paper The effectiveness of MAE pre-pretraining for billion-scale pretraining. Models are available for both MAE pre-pretraining and the follow up WSP pretraining, MAE→WSP a.k.a. MAWS (Masked Autoencoding → Weakly Supervised pretraining).

<p align="center"> <img width="539" alt="image" src="https://github.com/facebookresearch/maws/assets/13458796/69afa2ca-9976-4c64-9814-1f906be05e36"> </p>

Getting started

To get started with playing with our models immediately, we have a notebook available to play with on Colab, or locally for running our models in zero-shot mode.

For building any of our models, select which model type you would like to build. We have models available for:

  1. model_type="maws": MAWS (MAE→WSP) pretraining, i.e. MAE pre-pretraining followed by WSP pretraining. We also have ImageNet-1k finetuned weights for MAWS models using the same model type.
  2. model_type="maws_clip": MAWS pretrained models along with LiT aligned text encoders for CLIP style zero shot classification
  3. model_type="mae": MAE pretrained models
  4. model_type="mae_in1k": MAE pretrained on ImageNet-1k models

To access a model, specify the model architecture and the model type:

from maws.model_builder import build_model

# build a MAWS model with CLIP capabilities (via an aligned text encoder)
clip_model = build_model("vit_b16_xlmr_b", "maws_clip")

# build a MAWS model
maws_model = build_model("vit_b16", "maws")

# build a MAWS model finetuned on IN1k
maws_in1k_model = build_model("vit_b16_ft_in1k", "maws")

# build an MAE model
mae_model = build_model("vit_b16", "mae")

The models are also available via torch.hub:

# build a MAWS model with CLIP capabilities (via an aligned text encoder)
clip_model = torch.hub.load("facebookresearch/maws", model="vit_b16_xlmr_b_maws_clip")

# build a MAWS model
maws_model = torch.hub.load("facebookresearch/maws", model="vit_b16_maws")

# build a MAWS model finetuned on IN1k
maws_model = torch.hub.load("facebookresearch/maws", model="vit_b16_ft_in1k_maws")

# build an MAE model
mae_model = torch.hub.load("facebookresearch/maws", model="vit_b16_mae")

We list down all the available models and direct download links in the following section.

Installation instructions

conda create --name maws python=3.10
conda activate maws
pip install torch torchvision torchtext
pip install timm==0.9.7
# for demo
pip install jupyter ipywidgets matplotlib

Available models

MAWS pretrained models

ModelPretrained name + weightsIN1k 224px linear top-1IN1k 512/518px finetuned name + weightsIN1k 512/518px finetuned top-1Text encoder0-Shot name + weightsIN1k 224px 0-shot top-1
ViT-Bvit_b1683.3vit_b16_ft_in1k86.8XLMR-Bvit_b16_xlmr_b74.9
ViT-Lvit_l1686.1vit_l16_ft_in1k88.8XLMR-Lvit_l16_xlmr_l79.7
ViT-Hvit_h1487.5vit_h14_ft_in1k89.5XLMR-Lvit_h14_xlmr_l81.1
ViT-2Bvit_2b1488.1vit_2b14_ft_in1k89.8XLMR-Lvit_2b14_xlmr_l82.1
ViT-6.5Bvit_6.5b1488.6vit_6.5b14_ft_in1k90.1

MAE pretrained models

ModelPretrained name + weightsIN1k 224px finetuned top-1
ViT-Bvit_b1683.5
ViT-Lvit_l1686.1
ViT-Hvit_h1487.4
ViT-2Bvit_2b1487.8
ViT-6.5Bvit_6.5b1488.3

MAE pretrained on ImageNet-1k

ModelPretrained name + weightsIN1k 224px finetuned top-1
ViT-2Bvit_2b1487.4

MAE pretrained on ImageNet-21k

ModelModel name + weightsIN1k 512px finetuned
ViT-Lvit_l1686.9

Evaluation on ImageNet-1k

Finetuned

We share weights for the MAWS models finetuned on ImageNet-1k at high resolution (512px for ViT-B, ViT-L and 518px for ViT-H, ViT-2B, ViT-6.5B). $IN1K_VAL_PATH should be the path to the ImageNet-1k val root folder.

python eval_finetuned.py -m vit_b16_ft_in1k -i 512 -b 25 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 86.832

python eval_finetuned.py -m vit_l16_ft_in1k -i 512 -b 10 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 88.796

python eval_finetuned.py -m vit_h14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 89.502

python eval_finetuned.py -m vit_2b14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 89.752

python eval_finetuned.py -m vit_6.5b14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 90.064

Zero-shot

Please refer to all the available model names in the MAWS Pretrained models section. $IN1K_VAL_PATH should be the path to the ImageNet-1k val root folder.

python eval_zeroshot.py -m vit_b16_xlmr_b -b 25 -p $IN1K_VAL_PATH
# Zero shot ImageNet-1k top-1 accuracy: 74.888

# Trying the french language instead with a larger model on a 32GB V100
python eval_zeroshot.py -m vit_2b14_xlmr_l --language french -b 5 -p $IN1K_VAL_PATH
# Zero shot ImageNet-1k top-1 accuracy: 62.622

Citation

If you use our models or if the work is useful in your research, please give us a star and cite:

@inproceedings{singh2023effectiveness,
    title={The effectiveness of MAE pre-pretraining for billion-scale pretraining},
    author={Singh, Mannat and Duval, Quentin and Alwala, Kalyan Vasudev and Fan, Haoqi and Aggarwal, Vaibhav and Adcock, Aaron and Joulin, Armand and Doll{\'a}r, Piotr and Feichtenhofer, Christoph and Girshick, Ross and Girdhar, Rohit and Misra, Ishan},
    booktitle={ICCV},
    year={2023}
}

License

Our models are released under the CC-BY-NC 4.0 license. See LICENSE for additional details.