Home

Awesome

Introduction

This repository is the official implementation of Contextual Transformer Networks for Visual Recognition for Object Detection and Instance Segmentation.

CoT is a unified self-attention building block, and acts as an alternative to standard convolutions in ConvNet. As a result, it is feasible to replace convolutions with their CoT counterparts for strengthening vision backbones with contextualized self-attention.

<p align="center"> <img src="images/framework.jpg" width="800"/> </p>

Usage

Requirement:

Clone the repository:

git clone https://github.com/JDAI-CV/CoTNet-ObjectDetection-InstanceSegmentation.git

Train

First, download the COCO dataset. Then copy the code into detectron2 and build detectron2. To train CoTNet-50 on a single node with 8 gpus:

python3 tools/train_net.py --num-gpus 8 --config-file configs/ObjectDetection/Faster-RCNN/CoTNet-50/faster_rcnn_CoT_50_FPN_1x.yaml

The training configs for CoTNet (e.g., CoTNet-50) can be found in the configs folder.

The pre-trained CoTNet models for Object Detection and Instance Segmentation can be downloaded here.

Results on Object Detection task

Faster-RCNN

BackboneAPAP50AP75APsAPmAPlconfig file
CoTNet-5043.5064.8447.5326.3647.5456.49log/config
CoTNeXt-5044.0665.7647.6527.0847.7057.21log/config
SE-CoTNetD-5043.9665.2048.2527.7147.0556.51log/config
CoTNet-10145.3566.8049.1828.6549.4758.82log/config
CoTNeXt-10146.1067.5050.2229.4449.8459.26log/config
SE-CoTNetD-10145.6666.8650.1129.8349.2559.17log/config

Cascade-RCNN

BackboneAPAP50AP75APsAPmAPlconfig file
CoTNet-5046.1164.6849.7528.7149.7660.28log/config
CoTNeXt-5046.7965.5450.5329.7450.4961.04log/config
SE-CoTNetD-5046.7764.9150.4628.9050.2860.92log/config
CoTNet-10148.1967.0052.1730.0052.3262.87log/config
CoTNeXt-10149.0267.6753.0331.4452.9563.17log/config
SE-CoTNetD-10149.0267.7853.1531.2652.7663.29log/config

Results on Instance Segmentation task

Mask-RCNN

BackboneAP(bb)AP50(bb)AP75(bb)AP(mk)AP50(mk)AP75(mk)config file
CoTNet-5044.0664.9948.2939.2862.1242.17log/config
CoTNeXt-5044.4765.7448.7139.6262.7042.35log/config
SE-CoTNetD-5044.1665.2648.3239.3862.1842.23log/config
CoTNet-10146.1767.1750.6340.8664.1843.64log/config
CoTNeXt-10146.6667.7050.9041.2164.4544.27log/config
SE-CoTNetD-10146.6767.8551.3041.5364.9244.69log/config

Cascade-Mask-RCNN

BackboneAP(bb)AP50(bb)AP75(bb)AP(mk)AP50(mk)AP75(mk)config file
CoTNet-5046.9465.3650.6940.2562.3743.38log/config
CoTNeXt-5047.6365.9351.6440.7663.3244.01log/config
SE-CoTNetD-5047.4465.9351.2740.7363.2244.09log/config
CoTNet-10148.9767.4253.1041.9864.8145.39log/config
CoTNeXt-10149.3567.8853.5342.2065.0045.69log/config
SE-CoTNetD-10149.2467.4553.3642.3864.7945.89log/config

Citing Contextual Transformer Networks

@article{cotnet,
  title={Contextual Transformer Networks for Visual Recognition},
  author={Li, Yehao and Yao, Ting and Pan, Yingwei and Mei, Tao},
  journal={arXiv preprint arXiv:2107.12292},
  year={2021}
}

Acknowledgements

Thanks the contribution of timm and awesome PyTorch team.