Home

Awesome

Masked Image Modeling with Denoising Contrast

Official PyTorch implementation and pretrained models of "Masked Image Modeling with Denoising Contrast" in International Conference on Learning Representations (ICLR) 2023.


Overview

Model Zoo

ArchEpochsResolutionAcc@1Fine-tuned model
ViT-S/16300224x22482.0model
ViT-B/16800224x22483.7model
ViT-L/16800224x22485.2model
ViT-L/161600224x22485.5model

Results on ImageNet1K

Result

Visualization

Visualize the self-attention map between [CLS] token and local tokens of the pre-trained ViT-B/16 model on ImageNet-1K, where (a) indicates ConMIM pretraining and (b) indicates the vanilla instance-level contrastive pre-training. Self-attention maps out of 12 attention heads are averaged. It can be observed that ConMIM-pretrained models are much more locally discriminative and aware of the visual context. Vis

Setup

Clone the github repo and install the required packages.

git clone https://github.com/TencentARC/ConMIM.git
pip install -r requirements.txt

For mixed-precision training, please install apex

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Data Preparation

/dataset
└── imagenet1k
    ├── train
    ├── val
    ├── train_map.txt
    └── val_map.txt

Pre-training on ImageNet-1K

OUTPUT_DIR="./output/conmim_pretrained"
DATA_PATH="./dataset/imagenet1k"
mkdir -p $OUTPUT_DIR

python -m torch.distributed.launch $@ run_conmim_pretraining.py \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --mask_ratio 0.75 \
        --model conmim_large_patch16_224 \
        --batch_size 64 --lr 7.5e-4 --warmup_epochs 10 --epochs 1600 \
        --clip_grad 1.0 --drop_path 0 --layer_scale_init_value 1e-5 \
        --mask_type 'random_mps32' \
        --imagenet_default_mean_and_std \
        --save_ckpt_freq 20

Fine-tuning on ImageNet-1K Classification

CKP="./output/conmim_pretrained/checkpoint_copy-799.pth"
OUTPUT_DIR="./output/conmim_finetuned/"
DATA_PATH="/dataset/imagenet1k/"
mkdir -p $OUTPUT_DIR

python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
    --model beit_base_patch16_224 --data_path ${DATA_PATH}\
    --finetune ${CKP} \
    --output_dir ${OUTPUT_DIR} --batch_size 128 --lr 4e-3 --update_freq 1 \
    --warmup_epochs 20 --epochs 100 --layer_decay 0.65 --drop_path 0.1 \
    --weight_decay 0.05 --mixup 0.8 --cutmix 1.0 --nb_classes 1000 --enable_deepspeed \
    --imagenet_default_mean_and_std

Fine-tuning on ADE20K Semantic Segmentation

We follow the BEiT to complete our experiments

Fine-tuning on COCO Detection and Segmentation

We follow the MIMDet to complete our experiments

Acknowledgement

This repository is built using the BEiT repository, the mc-BEiT repository, the timm library, the DeiT repository, and the MIMDet repository.

Citation

If you find our work is useful for your research, please kindly cite our paper.

@article{yi2022masked,
  title={Masked image modeling with denoising contrast},
  author={Yi, Kun and Ge, Yixiao and Li, Xiaotong and Yang, Shusheng and Li, Dian and Wu, Jianping and Shan, Ying and Qie, Xiaohu},
  journal={International Conference on Learning Representations},
  year={2023}
}

Contact

If you have any questions, you can contact me from the email: kunyi@tencent.com or laneyikun@foxmail.com