Home

Awesome

Feature Pyramid Transformer

Implementation for paper: Feature Pyramid Transformer.

Contents

  1. Overview
  2. Requirements
  3. Data Preparation
  4. Pretrained Model
  5. Model Training
  6. Inference
  7. Citation
  8. Question

Overview

Feature interactions across space and scales underpin modern visual recognition systems because they introduce beneficial visual contexts. Conventionally, spatial contexts are passively hidden in the CNN's increasing receptive fields or actively encoded by non-local convolution. Yet, the non-local spatial interactions are not across scales, and thus they fail to capture the non-local contexts of objects (or parts) residing in different scales. To this end, we propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT). It transforms any feature pyramid into another feature pyramid of the same size but with richer contexts, by using three specially designed transformers in self-level, top-down, and bottom-up interaction fashion. FPT serves as a generic visual backbone with fair computational overhead. We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks, using various backbones and head networks, and observe consistent improvement over all the baselines and the state-of-the-art methods.

<div align="center"> <img src="demos/screenshot_20200731170229.png" width="700px"/> <p> Overall structure of our proposed FPT. Different texture patterns indicate different feature transformers, and different color represents feature maps with different scales. "Conv" denotes a 3 × 3 convolution with the output dimension of 256. Without loss of generality, the top/bottom layer feature maps has no rendering/grounding transformer.</p> </div>

Requirements

Data Preparation

Create a data folder under the repo,

cd {repo_root}
mkdir data

Pretrained Model

ImageNet Pretrained Model from Caffe

Download them and put them into the {repo_root}/data/pretrained_model.

If you want to use pytorch pre-trained models, please remember to transpose images from BGR to RGB, and also use the same data preprocessing as used in Pytorch pretrained model.

ImageNet Pretrained Model from Detectron

NOTE: Caffe pretrained weights have slightly better performance than the Pytorch pretrained weights.

Model Training

Train from scratch

Take mask-rcnn with resnet-50 backbone for example.

python tools/train_net_step.py --dataset coco2017 --cfg configs/e2e_fptnet_R-50_mask.yaml --use_tfboard --bs {batch_size} --nw {num_workers}

Use --bs to overwrite the default batch size to a proper value that fits into your GPUs. Simliar for --nw, number of data loader threads defaults to 4 in config.py.

Specify —-use_tfboard to log the losses on Tensorboard.

Finetune from a checkpoint

python tools/train_net_step.py ... --load_ckpt {path/to/the/checkpoint}

or using Detectron's checkpoint file

python tools/train_net_step.py ... --load_detectron {path/to/the/checkpoint}

Resume training with the same dataset and batch size

python tools/train_net_step.py ... --load_ckpt {path/to/the/checkpoint} --resume

When resume the training, step count and optimizer state will also be restored from the checkpoint. For SGD optimizer, optimizer state contains the momentum for each trainable parameter.

NOTE: --resume is not yet supported for --load_detectron

Set config options in command line

  python tools/train_net_step.py ... --no_save --set {config.name1} {value1} {config.name2} {value2} ...

Show command line help messages

python train_net_step.py --help

Inference

Evaluate the training results

For example, on coco2017 val set

python tools/test_net.py --dataset coco2017 --cfg configs/e2e_fptnet_R-50_mask.yaml --load_ckpt {path/to/your/checkpoint}

Results visualization

python tools/infer_simple.py --dataset coco --cfg configs/e2e_fptnet_R-50_mask.yaml --load_ckpt {path/to/your/checkpoint} --image_dir {dir/of/input/images}  --output_dir {dir/to/save/visualizations}

My nn.DataParallel

Citation

If our work is useful for your research, please consider citing:

@inproceedings{zhang2020fpt,
  author = {Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua and Qianru Sun},
  title = {Feature Pyramid Transformer},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year = {2020}
}

Questions

Please contact 'dongzhang@njust.edu.cn'