Home

Awesome

CAE: Context AutoEncoder for Self-Supervised Representation Learning

<p align="center"> <img src='furnace/CAE.png'> </p>

This is a PyTorch implementation of CAE: Context AutoEncoder for Self-Supervised Representation Learning.

Highlights

Installation

Clone the repo and install required packages.

pip install -r requirements.txt

# install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Data Preparation

First, download ImageNet-1k from http://image-net.org/.

The directory structure is the standard layout of torchvision's datasets.ImageFolder. The training and validation data are expected to be in the train/ folder and val folder, respectively:

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img3.jpeg
    class/2
      img4.jpeg

Second, download the pretrained tokenizer.

TOKENIZER_PATH=/path/to/save/dall_e_tokenizer_weight
mkdir -p $TOKENIZER_PATH
wget -o $TOKENIZER_PATH/encoder.pkl https://cdn.openai.com/dall-e/encoder.pkl
wget -o $TOKENIZER_PATH/decoder.pkl https://cdn.openai.com/dall-e/decoder.pkl

Pretraining

Here is an example that pretrains CAE-base on ImageNet-1K with 32 GPUs. Please see scripts/cae_base_800e.sh for complete script.

OMP_NUM_THREADS=1 $PYTHON -m torch.distributed.launch \
  --nproc_per_node=8 \
  tools/run_pretraining.py \
  --data_path ${DATA_PATH} \
  --output_dir ${OUTPUT_DIR} \
  --model cae_base_patch16_224_8k_vocab --discrete_vae_weight_path ${TOKENIZER_PATH} \
  --batch_size 64 --lr 1.5e-3 --warmup_epochs 20 --epochs 800 \
  --clip_grad 3.0 --layer_scale_init_value 0.1 \
  --imagenet_default_mean_and_std \
  --color_jitter 0 \
  --drop_path 0.1 \
  --sincos_pos_emb \
  --mask_generator block \
  --num_mask_patches 98 \
  --decoder_layer_scale_init_value 0.1 \
  --no_auto_resume \
  --save_ckpt_freq 100 \
  --exp_name $my_name \
  --regressor_depth 4 \
  --decoder_depth 4 \
  --align_loss_weight 2

Warmup epochs for 300/800/1600 epochs pretraining are 10/20/40.

For CAE-large, please refer to scripts/cae_large_1600e.sh.

Results

Here provides the results of CAE-base/CAE-large for these evaluation tasks:

Pretrained weights and logs are available (Google Drive, Baidu Cloud [Code: 4kil]). *: from CAE paper.

ModelPretraining data#EpochLinearAttentiveFine-tuningADE SegCOCO DetCOCO InstSeg
MAE-base*ImageNet-1K160067.874.283.648.148.442.6
MAE-large*ImageNet-1K160076.078.886.053.654.047.1
CAE-baseImageNet-1K30064.574.083.648.148.342.7
CAE-baseImageNet-1K80068.975.983.849.749.943.9
CAE-baseImageNet-1K160070.377.283.950.350.344.2
CAE-largeImageNet-1K160077.881.286.254.954.547.5

Linear Probing

Attentive Probing

Fine-tuning

Segmentation & Detection

Acknowledgement

This repository is built using the BEiT and MMSelfSup, thanks for their open-source code! Thanks also to the CAE authors for their excellent work!

Citation

@article{ContextAutoencoder2022,
  title={Context Autoencoder for Self-Supervised Representation Learning},
  author={Chen, Xiaokang and Ding, Mingyu and Wang, Xiaodi and Xin, Ying and Mo, Shentong and Wang, Yunhao and Han, Shumin and Luo, Ping and Zeng, Gang and Wang, Jingdong},
  journal={arXiv preprint arXiv:2202.03026},
  year={2022}
}