Awesome

MAWS

[Paper] [Colab] [BibTex] [Website]

Repository for the strong foundational MAWS + MAE models at all sizes ranging from <100M parameters to >6.5B parameters, from the paper The effectiveness of MAE pre-pretraining for billion-scale pretraining. Models are available for both MAE pre-pretraining and the follow up WSP pretraining, MAE→WSP a.k.a. MAWS (Masked Autoencoding → Weakly Supervised pretraining).

Getting started

To get started with playing with our models immediately, we have a notebook available to play with on Colab, or locally for running our models in zero-shot mode.

For building any of our models, select which model type you would like to build. We have models available for:

model_type="maws": MAWS (MAE→WSP) pretraining, i.e. MAE pre-pretraining followed by WSP pretraining. We also have ImageNet-1k finetuned weights for MAWS models using the same model type.
model_type="maws_clip": MAWS pretrained models along with LiT aligned text encoders for CLIP style zero shot classification
model_type="mae": MAE pretrained models
model_type="mae_in1k": MAE pretrained on ImageNet-1k models

To access a model, specify the model architecture and the model type:

from maws.model_builder import build_model

# build a MAWS model with CLIP capabilities (via an aligned text encoder)
clip_model = build_model("vit_b16_xlmr_b", "maws_clip")

# build a MAWS model
maws_model = build_model("vit_b16", "maws")

# build a MAWS model finetuned on IN1k
maws_in1k_model = build_model("vit_b16_ft_in1k", "maws")

# build an MAE model
mae_model = build_model("vit_b16", "mae")

The models are also available via torch.hub:

# build a MAWS model with CLIP capabilities (via an aligned text encoder)
clip_model = torch.hub.load("facebookresearch/maws", model="vit_b16_xlmr_b_maws_clip")

# build a MAWS model
maws_model = torch.hub.load("facebookresearch/maws", model="vit_b16_maws")

# build a MAWS model finetuned on IN1k
maws_model = torch.hub.load("facebookresearch/maws", model="vit_b16_ft_in1k_maws")

# build an MAE model
mae_model = torch.hub.load("facebookresearch/maws", model="vit_b16_mae")

We list down all the available models and direct download links in the following section.

Installation instructions

conda create --name maws python=3.10
conda activate maws
pip install torch torchvision torchtext
pip install timm==0.9.7
# for demo
pip install jupyter ipywidgets matplotlib

Available models

MAWS pretrained models

Model	Pretrained name + weights	IN1k 224px linear top-1	IN1k 512/518px finetuned name + weights	IN1k 512/518px finetuned top-1	Text encoder	0-Shot name + weights	IN1k 224px 0-shot top-1
ViT-B	vit_b16	83.3	vit_b16_ft_in1k	86.8	XLMR-B	vit_b16_xlmr_b	74.9
ViT-L	vit_l16	86.1	vit_l16_ft_in1k	88.8	XLMR-L	vit_l16_xlmr_l	79.7
ViT-H	vit_h14	87.5	vit_h14_ft_in1k	89.5	XLMR-L	vit_h14_xlmr_l	81.1
ViT-2B	vit_2b14	88.1	vit_2b14_ft_in1k	89.8	XLMR-L	vit_2b14_xlmr_l	82.1
ViT-6.5B	vit_6.5b14	88.6	vit_6.5b14_ft_in1k	90.1

MAE pretrained models

Model	Pretrained name + weights	IN1k 224px finetuned top-1
ViT-B	vit_b16	83.5
ViT-L	vit_l16	86.1
ViT-H	vit_h14	87.4
ViT-2B	vit_2b14	87.8
ViT-6.5B	vit_6.5b14	88.3

MAE pretrained on ImageNet-1k

Model	Pretrained name + weights	IN1k 224px finetuned top-1
ViT-2B	vit_2b14	87.4

MAE pretrained on ImageNet-21k

Model	Model name + weights	IN1k 512px finetuned
ViT-L	vit_l16	86.9

Evaluation on ImageNet-1k

Finetuned

We share weights for the MAWS models finetuned on ImageNet-1k at high resolution (512px for ViT-B, ViT-L and 518px for ViT-H, ViT-2B, ViT-6.5B). $IN1K_VAL_PATH should be the path to the ImageNet-1k val root folder.

python eval_finetuned.py -m vit_b16_ft_in1k -i 512 -b 25 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 86.832

python eval_finetuned.py -m vit_l16_ft_in1k -i 512 -b 10 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 88.796

python eval_finetuned.py -m vit_h14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 89.502

python eval_finetuned.py -m vit_2b14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 89.752

python eval_finetuned.py -m vit_6.5b14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 90.064

Zero-shot

Please refer to all the available model names in the MAWS Pretrained models section. $IN1K_VAL_PATH should be the path to the ImageNet-1k val root folder.

python eval_zeroshot.py -m vit_b16_xlmr_b -b 25 -p $IN1K_VAL_PATH
# Zero shot ImageNet-1k top-1 accuracy: 74.888

# Trying the french language instead with a larger model on a 32GB V100
python eval_zeroshot.py -m vit_2b14_xlmr_l --language french -b 5 -p $IN1K_VAL_PATH
# Zero shot ImageNet-1k top-1 accuracy: 62.622

Citation

If you use our models or if the work is useful in your research, please give us a star and cite:

@inproceedings{singh2023effectiveness,
    title={The effectiveness of MAE pre-pretraining for billion-scale pretraining},
    author={Singh, Mannat and Duval, Quentin and Alwala, Kalyan Vasudev and Fan, Haoqi and Aggarwal, Vaibhav and Adcock, Aaron and Joulin, Armand and Doll{\'a}r, Piotr and Feichtenhofer, Christoph and Girshick, Ross and Girdhar, Rohit and Misra, Ishan},
    booktitle={ICCV},
    year={2023}
}

License

Our models are released under the CC-BY-NC 4.0 license. See LICENSE for additional details.