Awesome
Localization Distillation for Object Detection
English | 简体中文
LD for horizontal bbox object detector is available at https://github.com/HikariTJU/LD.
This repo is based on MMRotate.
Analysis of LD in ZhiHu: 目标检测-定位蒸馏 (LD, CVPR 2022) and 目标检测-定位蒸馏续集——logit蒸馏与feature蒸馏之争
This is the code for our paper:
@Inproceedings{zheng2022LD,
title={Localization Distillation for Dense Object Detection},
author={Zheng, Zhaohui and Ye, Rongguang and Wang, Ping and Ren, Dongwei and Zuo, Wangmeng and Hou, Qibin and Cheng, Ming-Ming},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages={9407--9416},
year={2022}
}
@Article{zheng2023rotatedLD,
title={Localization Distillation for Object Detection},
author= {Zheng, Zhaohui and Ye, Rongguang and Hou, Qibin and Ren, Dongwei and Wang, Ping and Zuo, Wangmeng and Cheng, Ming-Ming},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2023},
volume={45},
number={8},
pages={10070-10083},
doi={10.1109/TPAMI.2023.3248583}}
[2021.3.30] LD is officially included in MMDetection V2, many thanks to @jshilong , @Johnson-Wang and @ZwwWayne for helping migrating the code.
LD is the extension of knowledge distillation on localization task, which utilizes the learned bbox distributions to transfer the localization dark knowledge from teacher to student.
LD stably improves over rotated detectors without adding any computational cost!
Introduction
Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the classification logits due to its inefficiency in distilling the localization information. In this paper, we investigate whether logit mimicking always lags behind feature imitation. Towards this goal, we first present a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student. Second, we introduce the concept of valuable localization region that can aid to selectively distill the classification and localization knowledge for a certain region. Combining these two new components, for the first time, we show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking underperforms for years. The thorough studies exhibit the great potential of logit mimicking that can significantly alleviate the localization ambiguity, learn robust feature representation, and ease the training difficulty in the early stage. We also provide the theoretical connection between the proposed LD and the classification KD, that they share the equivalent optimization effect. Our distillation scheme is simple as well as effective and can be easily applied to both dense horizontal object detectors and rotated object detectors. Extensive experiments on the MS COCO, PASCAL VOC, and DOTA benchmarks demonstrate that our method can achieve considerable AP improvement without any sacrifice on the inference speed. <img src="LD.png" height="220" align="middle"/>
Installation
Please refer to INSTALL.md for installation and dataset preparation. Pytorch=1.5.1 and cudatoolkits=10.1 are recommended.
Get Started
Please see GETTING_STARTED.md for the basic usage of MMDetection.
Data Preparation
Please refer to data_preparation.md to prepare the data.
Evaluation Tool
Move the file tests/val_set.txt
to /yourpath/dataset/DOTAv1/
.
Download https://github.com/CAPTAIN-WHU/DOTA_devkit, which is an official evaluation tool for DOTA.
Replace dota_evaluation_task1.py
with our dota_evaluation_task1.py.
Open dota_evaluation_task1.py
and modify detpath
, annopath
and imagesetfile
to your own path.
After running the test, run
python yourpath/DOTA_devkit-master/dota_evaluation_task1.py
AP, AP50, AP55, ... , AP95 will be printed in the terminal.
Convert model
If you find trained model very large, please refer to publish_model.py
python tools/model_converters/publish_model.py your_model.pth your_new_model.pth
Train a Teacher Model
You must train a teacher model first, which must be general distribution kind. The learning rate should be adapted to the number of GPU.
./tools/dist_train.sh configs/gwd/rotated_retinanet_distribution_hbb_gwd_r34_fpn_2x_dota_oc.py 1
Evaluation Results
DOTA-1.0 val
Rotated-RetinaNet, LD + KD
Teacher | Student | Training schedule | AP | AP50 | AP70 | AP90 | download |
---|---|---|---|---|---|---|---|
-- | R-18 | 1x | 33.7 | 58.0 | 42.3 | 4.7 | |
R-34 | R-18 | 1x | 39.1 | 63.8 | 48.8 | 8.8 | model |
GWD, LD + KD
Teacher | Student | Training schedule | AP | AP50 | AP70 | AP90 | download |
---|---|---|---|---|---|---|---|
-- | R-18 | 1x | 37.1 | 63.1 | 46.7 | 6.2 | |
R-34 | R-18 | 1x | 40.2 | 66.4 | 50.3 | 8.5 | model |
Note:
-
Teacher detector adopts 2x training schedule (24 epochs), student detector adopts 1x (12 epochs)。We use DOTA-v1.0 train set for training, and val set for evaluation。
-
Number of GPU is 2, mini batchsize is 1 per GPU。We found that even though the batchsize was fixed, single GPU training produced higher AP than double GPUs training.
-
On DOTA, we found LD and classification KD are equally important, which can improve the baseline (such as R-RetinaNet) by more than 3.5 AP. And using the combination of LD and KD reaches the highest.
Acknowledgments
Thank you to yangxue0827 for his help of data preparation and his exellent works for rotated object detection.