Awesome
MAWS
[Paper
] [Colab
] [BibTex
] [Website
]
Repository for the strong foundational MAWS + MAE models at all sizes ranging from <100M parameters to >6.5B parameters, from the paper The effectiveness of MAE pre-pretraining for billion-scale pretraining. Models are available for both MAE pre-pretraining and the follow up WSP pretraining, MAE→WSP a.k.a. MAWS (Masked Autoencoding → Weakly Supervised pretraining).
<p align="center"> <img width="539" alt="image" src="https://github.com/facebookresearch/maws/assets/13458796/69afa2ca-9976-4c64-9814-1f906be05e36"> </p>Getting started
To get started with playing with our models immediately, we have a notebook available to play with on Colab, or locally for running our models in zero-shot mode.
For building any of our models, select which model type you would like to build. We have models available for:
model_type="maws"
: MAWS (MAE→WSP) pretraining, i.e. MAE pre-pretraining followed by WSP pretraining. We also have ImageNet-1k finetuned weights for MAWS models using the same model type.model_type="maws_clip"
: MAWS pretrained models along with LiT aligned text encoders for CLIP style zero shot classificationmodel_type="mae"
: MAE pretrained modelsmodel_type="mae_in1k"
: MAE pretrained on ImageNet-1k models
To access a model, specify the model architecture and the model type:
from maws.model_builder import build_model
# build a MAWS model with CLIP capabilities (via an aligned text encoder)
clip_model = build_model("vit_b16_xlmr_b", "maws_clip")
# build a MAWS model
maws_model = build_model("vit_b16", "maws")
# build a MAWS model finetuned on IN1k
maws_in1k_model = build_model("vit_b16_ft_in1k", "maws")
# build an MAE model
mae_model = build_model("vit_b16", "mae")
The models are also available via torch.hub:
# build a MAWS model with CLIP capabilities (via an aligned text encoder)
clip_model = torch.hub.load("facebookresearch/maws", model="vit_b16_xlmr_b_maws_clip")
# build a MAWS model
maws_model = torch.hub.load("facebookresearch/maws", model="vit_b16_maws")
# build a MAWS model finetuned on IN1k
maws_model = torch.hub.load("facebookresearch/maws", model="vit_b16_ft_in1k_maws")
# build an MAE model
mae_model = torch.hub.load("facebookresearch/maws", model="vit_b16_mae")
We list down all the available models and direct download links in the following section.
Installation instructions
conda create --name maws python=3.10
conda activate maws
pip install torch torchvision torchtext
pip install timm==0.9.7
# for demo
pip install jupyter ipywidgets matplotlib
Available models
MAWS pretrained models
Model | Pretrained name + weights | IN1k 224px linear top-1 | IN1k 512/518px finetuned name + weights | IN1k 512/518px finetuned top-1 | Text encoder | 0-Shot name + weights | IN1k 224px 0-shot top-1 |
---|---|---|---|---|---|---|---|
ViT-B | vit_b16 | 83.3 | vit_b16_ft_in1k | 86.8 | XLMR-B | vit_b16_xlmr_b | 74.9 |
ViT-L | vit_l16 | 86.1 | vit_l16_ft_in1k | 88.8 | XLMR-L | vit_l16_xlmr_l | 79.7 |
ViT-H | vit_h14 | 87.5 | vit_h14_ft_in1k | 89.5 | XLMR-L | vit_h14_xlmr_l | 81.1 |
ViT-2B | vit_2b14 | 88.1 | vit_2b14_ft_in1k | 89.8 | XLMR-L | vit_2b14_xlmr_l | 82.1 |
ViT-6.5B | vit_6.5b14 | 88.6 | vit_6.5b14_ft_in1k | 90.1 |
MAE pretrained models
Model | Pretrained name + weights | IN1k 224px finetuned top-1 |
---|---|---|
ViT-B | vit_b16 | 83.5 |
ViT-L | vit_l16 | 86.1 |
ViT-H | vit_h14 | 87.4 |
ViT-2B | vit_2b14 | 87.8 |
ViT-6.5B | vit_6.5b14 | 88.3 |
MAE pretrained on ImageNet-1k
Model | Pretrained name + weights | IN1k 224px finetuned top-1 |
---|---|---|
ViT-2B | vit_2b14 | 87.4 |
MAE pretrained on ImageNet-21k
Model | Model name + weights | IN1k 512px finetuned |
---|---|---|
ViT-L | vit_l16 | 86.9 |
Evaluation on ImageNet-1k
Finetuned
We share weights for the MAWS models finetuned on ImageNet-1k at high resolution (512px for ViT-B, ViT-L and 518px for ViT-H, ViT-2B, ViT-6.5B). $IN1K_VAL_PATH
should be the path to the ImageNet-1k val root folder.
python eval_finetuned.py -m vit_b16_ft_in1k -i 512 -b 25 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 86.832
python eval_finetuned.py -m vit_l16_ft_in1k -i 512 -b 10 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 88.796
python eval_finetuned.py -m vit_h14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 89.502
python eval_finetuned.py -m vit_2b14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 89.752
python eval_finetuned.py -m vit_6.5b14_ft_in1k -i 518 -b 5 -p $IN1K_VAL_PATH
# ImageNet-1k top-1 accuracy: 90.064
Zero-shot
Please refer to all the available model names in the MAWS Pretrained models section. $IN1K_VAL_PATH
should be the path to the ImageNet-1k val root folder.
python eval_zeroshot.py -m vit_b16_xlmr_b -b 25 -p $IN1K_VAL_PATH
# Zero shot ImageNet-1k top-1 accuracy: 74.888
# Trying the french language instead with a larger model on a 32GB V100
python eval_zeroshot.py -m vit_2b14_xlmr_l --language french -b 5 -p $IN1K_VAL_PATH
# Zero shot ImageNet-1k top-1 accuracy: 62.622
Citation
If you use our models or if the work is useful in your research, please give us a star and cite:
@inproceedings{singh2023effectiveness,
title={The effectiveness of MAE pre-pretraining for billion-scale pretraining},
author={Singh, Mannat and Duval, Quentin and Alwala, Kalyan Vasudev and Fan, Haoqi and Aggarwal, Vaibhav and Adcock, Aaron and Joulin, Armand and Doll{\'a}r, Piotr and Feichtenhofer, Christoph and Girshick, Ross and Girdhar, Rohit and Misra, Ishan},
booktitle={ICCV},
year={2023}
}
License
Our models are released under the CC-BY-NC 4.0 license. See LICENSE for additional details.