


This repository is a PyTorch implementation of <i>On Equivariant and Invariant Learning of Object Landmark Representations</i> by Zezhou Cheng, Jong-Chyi Su, Subhransu Maji. ICCV 2021.

[arXiv] [Project page] [Poster] [Supplementary material]


The implementation is based on DVE [Thewlis et al. ICCV 2019] and CMC [Tian et al. 2019]. (Dependencies: tensorboard-logger, pytorch=1.4.0, torchfile)

To install:

conda env create -f environment.yml
conda activate ContrastLandmark


Human faces




Stage 1: invariant representation learning

CUDA_VISIBLE_DEVICES=0,1,2,3 python train_moco.py --batch_size 256 --num_workers 12 --nce_k 4096 --cosine  --epochs 800 --model resnet50 --image_crop 20 --image_size 136 --model_name moco_CelebA --model_path /path/to/save/model --dataset CelebA --data_folder datasets/celeba
CUDA_VISIBLE_DEVICES=0,1,2,3 python train_moco.py --batch_size 256 --num_workers 12 --nce_k 4096 --cosine  --epochs 800 --model resnet50 --image_crop 0 --image_size 96 --model_name moco_InatAve --model_path /path/to/save/model --dataset InatAve --imagelist /path/to/imagelist/inat_train_100K.txt

Stage 2: equivariant representation projection

CUDA_VISIBLE_DEVICES=0,1 python train_feature_projector.py --model resnet50 --feat_distill --image_crop 20 --image_size 136 --train_layer 4 --val_layer 4 --trained_model_path /path/to/pretrained_moco --adam --epochs 10 --cosine --batch_size 32 --log_path /path/to/logfile.log --model_name feature_projector --model_path /path/to/save/checkpoint --train_use_hypercol --val_use_hypercol --vis_path /path/to/save/visualization --train_out_size 24 --val_out_size 96 --distill_mode softmax --kernel_size 1 --out_dim 128 --softargmax_mul 7. --temperature 7. 



1. Landmark regression

Face benchmarks (CelebA → AFLW)
CUDA_VISIBLE_DEVICES=0,1 python eval_face.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path /path/to/pretrainedMoCo --learning_rate 0.001 --weight_decay 0.0005 --adam --epochs 200 --cosine --batch_size 32 --log_path /path/to/logfile --dataset AFLW_MTFL --model_name AFLW_M_regressor --model_path /path/to/save/regressor --image_crop 20 --image_size 136 --use_hypercol
CUDA_VISIBLE_DEVICES=0 python eval_face.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path /path/to/pretrainedMoCo --learning_rate 0.01 --weight_decay 0.05 --adam --epochs 1000 --cosine --batch_size 32 --log_path /path/to/logfile --dataset AFLW_MTFL --model_name AFLW_M_regressor --model_path /path/to/save/regressor --image_crop 20 --image_size 136 --restrict_annos 50  --repeat --TPS_aug --use_hypercol

Note: the number of GPUs used to train the linear regressor has impact on the convergence rate, the possible reason is the batch normalization is conducted separately on different GPUs. We stop the training procedure at 120th, 45th, 80th epoch on MAFL, AFLW, and 300W benchmarks respectively on 2 GPUs (determined based on our initial results and kept fixed in our experiments). However, the stopping points may be suboptimal when you train the regressor on a different number of GPUs.

Bird benchmarks (iNat → CUB)
CUDA_VISIBLE_DEVICES=0,1 python eval_animal.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path /path/to/pretrainedMoCo --learning_rate 0.01 --weight_decay 0.005 --adam --epochs 2000 --cosine --batch_size 32 --log_path /path/to/logfile --dataset CUB --model_name CUB_regressor --model_path /path/to/save/regressor --image_crop 0 --image_size 96 --imagelist /path/to/trainlist/train.txt --use_hypercol

Note: check out data_loaders_animal.py, place the annotation files (train.dat, val.data) and train/val/test text files under ./datasets/CUB-200-2011. About hyperparameter settings on bird benchmarks, if the number of annotations is smaller or equal to 100 (e.g. 10, 50, 100), lr=0.01 and weight decay=0.05 for ResNet18, ResNet50, and DVE; if more annotations (e.g. 250, 500, 1241) are available, lr=0.01 and weight decay=0.005 for ResNet18 and ResNet50, but lr=0.01 and weight decay=0.0005 for DVE (because DVE has much better performance with WD=0.0005 than WD=0.05 or 0.005)

2. Landmark matching

CUDA_VISIBLE_DEVICES=0,1 python train_feature_projector.py --model resnet50 --feat_distill --image_crop 20 --image_size 136 --train_layer 4 --val_layer 4 --trained_model_path /path/to/pretrained_moco  --log_path /path/to/logfile.log --model_name feature_projector --model_path /path/to/save/tmpfile --train_use_hypercol --val_use_hypercol  --train_out_size 24 --val_out_size 96 --distill_mode softmax --kernel_size 1 --out_dim 128 --softargmax_mul 7. --temperature 7. --evaluation_mode --trained_feat_model_path /path/to/pretrained-feature-projector --visualize_matching --vis_path /path/to/save/visualization


Pretrained models

Download the pretrained models

  1. Celeb: [MoCo-ResNet18-CelebA] [MoCo-ResNet50-CelebA] [MoCo-ResNet50-CelebA-In-the-Wild]
  2. iNat Aves: [MoCo-ResNet18-iNat] [MoCo-ResNet50-iNat] [DVE-Hourglass-iNat]

Note: On face benchmarks, the numbers in Table 1 in the main text are reported at 120th, 45th, 80th epoch for MAFL, AFLW and 300W. The epoch is indexing from 0. However, the index was starting from 1 when we saved the model. This leads to different scores with the saved model from these in Table 1 (either slightly better or slightly worse).

The feature projectors are trained under different network architectures (e.g. ResNet18, ResNet50, ResNet50-half, etc.) and pretraining methods (e.g. MoCo, ImageNet, Random Init etc.). These settings corresponds to Table 4 and Table 5 in the supplementary material.

Run pretrained landmark detectors

After downloading the pretrained models, run the following commands to evaluate and visualize the pretrained models.

  1. Face benchmarks:
CUDA_VISIBLE_DEVICES=0 python vis_face.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path $pretrained_MOCO_FACE --batch_size 32 --log_path $log_file --dataset AFLW --image_crop 20 --image_size 136 --ckpt_path $pretrained_AFLW_R  --vis_path $visdir --use_hypercol --vis_keypoints
  1. Bird benchmarks:
CUDA_VISIBLE_DEVICES=0 python vis_animal.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path $pretrained_MOCO_inat --batch_size 32 --log_path $log_file --dataset CUB --image_crop 0 --image_size 96 --ckpt_path $pretrained_CUB  --vis_path $visdir --use_hypercol --vis_keypoints

Visualize the PCA projection of hypercolumn representation

  1. Face benchmarks:
CUDA_VISIBLE_DEVICES=0 python vis_face.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path $pretrained_MOCO_FACE --batch_size 32 --log_path $log_file --dataset MAFLAligned --image_crop 20 --image_size 136 --vis_path $visdir --use_hypercol --vis_PCA
  1. Bird benchmarks:
CUDA_VISIBLE_DEVICES=0 python vis_animal.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path $pretrained_MOCO_inat --batch_size 32 --log_path $log_file --dataset CUB --image_crop 0 --image_size 96   --vis_path $visdir --use_hypercol --vis_PCA


If you use this code for your research, please cite the following papers.

title={On Equivariant and Invariant Learning of Object Landmark Representations},
author={Cheng, Zezhou and Su, Jong-Chyi and Maji, Subhransu},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},