Home

Awesome

Plain-DETR

By Yutong Lin*, Yuhui Yuan*, Zheng Zhang*, Chen Li, Nanning Zheng and Han Hu*

This repo is the official implementation of "DETR Doesn’t Need Multi-Scale or Locality Design".

Introduction

We present an improved DETR detector that maintains a “plain” nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that re-introduce architectural inductive biases of multi-scale and locality into the decoder.

We show that two simple technologies are surprisingly effective within a plain design: 1) a box-to-pixel relative position bias (BoxRPB) term to guide each query to attend to the corresponding object region; 2) masked image modeling (MIM)-based backbone pre-training to help learn representation with fine-grained localization ability and to remedy dependencies on the multi-scale feature maps.

Main Results

BoxRPBMIM PT.Reparam.APPaper PositionCFGCKPT
37.2Tab2 Exp1cfgckpt
46.1Tab2 Exp2cfgckpt
48.7Tab2 Exp5cfgckpt
50.9Tab2 Exp6cfgckpt

Installation

Conda

# create conda environment
conda create -n plain_detr python=3.8 -y
conda activate plain_detr

# install pytorch (other versions may also work)
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

# other requirements
git clone https://github.com/impiga/Plain-DETR.git
cd Plain-DETR
pip install -r requirements.txt

Docker

We have tested with the docker image superbench/dev:cuda11.8. Other dockers may also work.

# run docker
sudo docker run -it -p 8022:22 -d --name=plain_detr --privileged --net=host --ipc=host --gpus=all -v /:/data superbench/dev:cuda11.8 bash
sudo docker exec -it plain_detr bash

# other requirements
git clone https://github.com/impiga/Plain-DETR.git
cd Plain-DETR
pip install -r requirements.txt

Usage

Dataset preparation

Please download COCO 2017 dataset and organize them as following:

code_root/
└── data/
    └── coco/
        ├── train2017/
        ├── val2017/
        └── annotations/
        	├── instances_train2017.json
        	└── instances_val2017.json

Pretrained models preparation

Please run the following script to download supervised and mask-image-modeling pretrained models.

(We adopts Swin Transformer v2 as the default backbone. If you are interested in the pretraining, please refer to Swin Transformer v2 (paper, github) and SimMIM (paper, github) for more details.)

bash tools/prepare_pt_model.sh

Training

Training on single node

GPUS_PER_NODE=<num gpus> ./tools/run_dist_launch.sh <num gpus> <path to config file>

Training on multiple nodes

On each node, run the following script:

MASTER_ADDR=<master node IP address> GPUS_PER_NODE=<num gpus> NODE_RANK=<rank> ./tools/run_dist_launch.sh <num gpus> <path to config file> 

Evaluation

To evalute a plain-detr model, please run the following script:

 <path to config file> --eval --resume <path to plain-detr model>

You could also use ./tools/run_dist_launch.sh to evaluate a model on multiple GPUs.

Limitation & Discussion

Known issues

Citing Plain-DETR

If you find Plain-DETR useful in your research, please consider citing:

inproceedings{lin2023detr,
  title={DETR Does Not Need Multi-Scale or Locality Design},
  author={Lin, Yutong and Yuan, Yuhui and Zhang, Zheng and Li, Chen and Zheng, Nanning and Hu, Han},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={6545--6554},
  year={2023}
}