Home

Awesome

arXiv

Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art

For adding your transformer-based object detector results into the tables below, please send us an email including the values for each column and a copy of the paper showing your results.

Email: aref.mirirekavandi@gmail.com

Taxonomy

Taxonomy of small object detection using transformers and popular object detection methods assigned to each category. image

Datasets

image image

image

Generic Applications (MS COCO) (Last Update: 15/06/2023)

Detection performance for small-scale objects on MS COCO image dataset (eval). DC5: Dialated C5 stage, MS: Multi-scale network, IBR: Iterative bounding box refinement, TS: Two-stage detection, DCN: Deformable convnets, TTA: Test time augmentation, BD: Pre-trained on BigDetection dataset, IN: Pre-trained on ImageNet, OB: Pre-trained on Object-365. $*$ shows the results for COCO test-dev.

ModelBackboneGFLOPS/FPS#params$\text{mAP}^{@[0.5,0.95]}$EpochsURL
Faster RCNN-DC5~(NeurIPS2015)ResNet50320/16166M21.437https://github.com/trzy/FasterRCNN
Faster RCNN-FPN~(NeurIPS2015)ResNet50180/2642M24.237https://github.com/trzy/FasterRCNN
Faster RCNN-FPN~(NeurIPS2015)ResNet101246/2060M25.2--https://github.com/trzy/FasterRCNN
RepPoints v2-DCN-MS~(NeurIPS2020)ResNeXt101--/----34.5*24https://github.com/Scalsol/RepPointsV2
FCOS~(ICCV2019)ResNet50177/17--26.236https://github.com/tianzhi0549/FCOS
CBNet V2-DCN~(TIP2022)Res2Net101--/--107M35.7*20https://github.com/VDIGPKU/CBNetV2
CBNet V2-DCN(Cascade RCNN)~(TIP2022)Res2Net101--/--146M37.4*32https://github.com/VDIGPKU/CBNetV2
DETR~(ECCV2020)ResNet5086/2841M20.5500https://github.com/facebookresearch/detr
DETR-DC5~(ECCV2020)ResNet50187/1241M22.5500https://github.com/facebookresearch/detr
DETR~(ECCV2020)ResNet10152/2060M21.9--https://github.com/facebookresearch/detr
DETR-DC5~(ECCV2020)ResNet101253/1060M23.7--https://github.com/facebookresearch/detr
ViT-FRCNN~(arXiv2020)----/----17.8----
RelationNet++~(NeurIPS2020)ResNeXt101--/----32.8*--https://github.com/microsoft/RelationNet2
RelationNet++-MS~(NeurIPS2020)ResNeXt101--/----35.8*--https://github.com/microsoft/RelationNet2
Deformable DETR~(ICLR2021)ResNet50173/1940M26.450https://github.com/fundamentalvision/Deformable-DETR
Deformable DETR-IBR~(ICLR2021)ResNet50173/1940M26.850https://github.com/fundamentalvision/Deformable-DETR
Deformable DETR-TS~(ICLR2021)ResNet50173/1940M28.850https://github.com/fundamentalvision/Deformable-DETR
Deformable DETR-TS-IBR-DCN~(ICLR2021)ResNeXt101--/----34.4*--https://github.com/fundamentalvision/Deformable-DETR
Dynamic DETR~(ICCV2021)ResNet50--/----28.6*----
Dynamic DETR-DCN~(ICCV2021)ResNeXt101--/----30.3*----
TSP-FCOS~(ICCV2021)ResNet101255/12--27.736https://github.com/Edward-Sun/TSP-Detection
TSP-RCNN~(ICCV2021)ResNet101254/9--29.996https://github.com/Edward-Sun/TSP-Detection
Mask R-CNN~(ICCV2021)Conformer-S/16457/--56.9M28.712https://github.com/pengzhiliang/Conformer
Conditional DETR-DC5~(ICCV2021)ResNet101262/--63M27.2108https://github.com/Atten4Vis/ConditionalDETR
SOF-DETR~(2022JVCIR)ResNet50--/----21.7--https://github.com/shikha-gist/SOF-DETR/
DETR++~(arXiv2022)ResNet50--/----22.1----
TOLO-MS~(NCA2022)----/57--24.1----
Anchor DETR-DC5~(AAAI2022)ResNet101--/----25.850https://github.com/megvii-research/AnchorDETR
DESTR-DC5~(CVPR2022)ResNet101299/--88M28.250--
Conditional DETR v2-DC5~(arXiv2022)ResNet101228/--65M26.350--
Conditional DETR v2~(arXiv2022)Hourglass48521/--90M32.150--
FP-DETR-IN~(ICLR2022)----/--36M26.550https://github.com/encounter1997/FP-DETR
DAB-DETR-DC5~(arXiv2022)ResNet101296/--63M28.150https://github.com/IDEA-Research/DAB-DETR
Ghostformer-MS~(Sensors2022)GhostNet--/----29.2100--
CF-DETR-DCN-TTA~(AAAI2022)ResNeXt101--/----35.1*----
CBNet V2-TTA~(CVPR2022)Swin Transformer-base--/----41.7--https://github.com/amazon-science/bigdetection
CBNet V2-TTA-BD~(CVPR2022)Swin Transformer-base--/----42.2--https://github.com/amazon-science/bigdetection
DETA~(arXiv2022)ResNet50--/1348M34.324https://github.com/jozhang97/DETA
DINO~(arXiv2022)ResNet50860/1047M32.312https://github.com/IDEA-Research/DINO
CO-DINO Deformable DETR-MS-IN~(arXiv2022)Swin Transformer-large--/----43.736https://github.com/Sense-X/Co-DETR
HYNETER~(ICASSP2023)Hyneter-Max--/--247M29.8*----
DeoT~(JRTIP2023)ResNet101217/1458M31.434--
ConformerDet-MS~(TPAMI2023)Conformer-B--/--147M35.336https://github.com/pengzhiliang/Conformer
YOLOS~(NeurIPS2021)DeiT-base--/3.9100M19.5150https://github.com/hustvl/YOLOS
DETR(ViT)~(arXiv2021)Swin Transformer-base--/9.7100M18.350https://github.com/naver-ai/vidt
Deformable DETR(ViT)~(arXiv2021)Swin Transformer-base--/4.8100M34.550https://github.com/naver-ai/vidt
ViDT~(arXiv2022)Swin Transformer-base--/9100M30.650https://github.com/naver-ai/vidt/tree/main
DFFT~(ECCV2022)DOT-medium67/----25.536https://github.com/PeixianChen/DFFT
CenterNet++-MS~(arXiv2022)Swin Transformer-large--/----38.7*--https://github.com/Duankaiwen/PyCenterNet
DETA-OB~(arXiv2022)Swin Transformer-large--/4.2--46.1*24https://github.com/jozhang97/DETA
Group DETR v2-MS-IN-OB~(arXiv2022)ViT-Huge--/--629M48.4*----
Best ResultsNADETR/TOLOFP-DETRGroup DETR v2DINONA

Small Object Detection in Aerial Images (DOTA) (Last Update: 15/06/2023)

Detection performance for objects on DOTA image dataset. MS: Multi-scale network, FT: Fine-tuned, FPN: Feature pyramid network, IN: Pre-trained on ImageNet.

ModelBackboneFPS#paramsmAPEpochsURL
Rotated Faster RCNN-MS~(NeurIPS2015)ResNet101--64M67.7150https://github.com/open-mmlab/mmrotate/tree/main/configs/rotated_faster_rcnn
SSD~(ECCV2016)------56.1--https://github.com/pierluigiferrari/ssd_keras
RetinaNet-MS~(ICCV2017)ResNet101--59M66.5350https://github.com/DetectionTeamUCAS/RetinaNet_Tensorflow
ROI-Transformer-MS-IN~(CVPR2019)ResNet50----80.0612https://github.com/open-mmlab/mmrotate/blob/main/configs/roi_trans/README.md
Yolov5~(2020)--95--64.5--https://github.com/ultralytics/yolov5
ReDet-MS-FPN~(CVPR2021)ResNet50----80.1--https://github.com/csuhan/ReDet
O2DETR-MS~(arXiv2021)ResNet101--63M70.0250--
O2DETR-MS-FT~(arXiv2021)ResNet101----76.2362--
O2DETR-MS-FPN-FT~(arXiv2021)ResNet50----79.66----
SPH-Yolov5~(RS2022)Swin Transformer-base51--71.6150--
AO2-DETR-MS~(TCSVT2022)ResNet50----79.22--https://github.com/Ixiaohuihuihui/AO2-DETR
MDCT~(RS2023)------75.7----
ReDet-MS-IN~(arXiv2023)ViTDet, ViT-B----80.8912https://github.com/csuhan/ReDet
Best ResultsNAYolov5RetinaNetReDet-MS-INReDet-MS-INNA

Small Object Detection in Medical Images (DeepLesion) (Last Update: 15/06/2023)

Detection performance for DeepLesion CT image dataset.

ModelAccuracy$\text{mAP}^{0.5}$
Faster RCNN~(NeurIPS2015)83.383.3
Yolov585.288.2
DETR~(ECCV2020)86.787.8
Swin Transformer82.981.2
MS Transformer~(CIN2022)90.389.6
Best ResultsMS TransformerMS Transformer

Small Object Detection in Active Milli-Meter Wave Images (AMWW) (Last Update: 15/06/2023)

Detection performance for AMWW image dataset.

ModelBackbone$\text{mAP}^{0.5}$$\text{mAP}^{@[0.5,0.95]}$
Faster RCNN~(NeurIPS2015)ResNet5070.726.83
Cascade RCNN~(CVPR2018)ResNet5074.727.8
TridentNet~(ICCV2019)ResNet5077.329.2
Dynamic RCNN~(ECCV2020)ResNet5076.327.6
Yolov5ResNet5076.6728.48
MATR~(TCSVT2022)ResNet5082.1633.42
Best ResultsNAMATRMATR

Small Object Detection in Underwater Images (URPC2018) (Last Update: 15/06/2023)

Detection performance for URPC2018 dataset.

Model#params$\text{mAP}^{@[0.5,0.95]}$$\text{mAP}^{0.5}$
Faster RCNN~(NeurIPS2015)33.6M16.4--
Cascade RCNN~(CVPR2018)68.9M16--
Dynamic RCNN~(ECCV2020)41.5M13.3--
Yolov361.5M19.4--
RoIMix~(ICASSP2020)----74.92
HTDet~(RS2023)7.7M22.8--
Best ResultsHTDetHTDetRoIMix

Small Object Detection in Videos (ImageNet VID) (Last Update: 15/06/2023)

Detection performance for ImageNet VID dataset for small objects. PT: Pre-trained on MS COCO.

ModelBackbone$\text{mAP}^{@[0.5,0.95]}$
Faster RCNN~(NeurIPS2015)+SELSAResNet508.5
Deformable-DETR-PTResNet5010.5
Deformable-DETR+TransVOD-PTResNet5011
DAB-DETR+FAQ-PTResNet5012
Deformable-DETR+FAQ-PTResNet5013.2
Best ResultsNADeformable DET+FAQ

Visual Results

Detection results on a sample image when zoomed in. First row from the left: Input image, SSD, Faster RCNN, DETR. Second row from the left: ViDT, DETA-OB, DINO, CBNetv2. image

Citations

If you found this page helpful, please cite the following survey papers:

@article{rekavandi2023transformers,
  title={Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art},
  author={Rekavandi Miri, Aref and Rashidi, Shima and Boussaid, Farid and Hoefs, Stephen and Akbas, Emre and Bennamoun, Mohammed},
  journal={arXiv preprint arXiv:2309.04902},
  year={2023}
}

@article{rekavandi2022guide,
  title={A Guide to Image and Video based Small Object Detection using Deep Learning: Case Study of Maritime Surveillance},
  author={Rekavandi Miri, Aref and Xu, Lian and Boussaid, Farid and Seghouane, Abd-Krim and Hoefs, Stephen and Bennamoun, Mohammed},
  journal={arXiv preprint arXiv:2207.12926},
  year={2022}
}