Awesome
imTED: Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection
<a href="https://openaccess.thecvf.com/content/ICCV2023/html/Liu_Integrally_Migrating_Pre-trained_Transformer_Encoder-decoders_for_Visual_Object_Detection_ICCV_2023_paper.html"><img src="https://img.shields.io/badge/ICCV2023-Paper-<color>"></a>
<!-- <div align=center><img src="figs/Framework1.png"></div> --> <div align=center><img src="figs/Framework2.png"></div> <!-- <div align=center><img src="figs/Framework3.png"></div> -->Code of our ICCV 2023 paper Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection. Blog in Chinese is available here.
The code is based on mmdetection, please refer to get_started.md and MMDET_README.md to set up the environment and prepare the data.
Config Files and Performance and Trained Weights
We provide 9 configuration files in the configs directory.
Config File | Backbone | Epochs | Box AP | Mask AP | Download |
---|---|---|---|---|---|
imted_faster_rcnn_vit_small_3x_coco | ViT-S | 36 | 48.2 | model | |
imted_faster_rcnn_vit_base_3x_coco | ViT-B | 36 | 52.9 | model | |
imted_faster_rcnn_vit_large_3x_coco | ViT-L | 36 | 55.4 | model | |
imted_mask_rcnn_vit_small_3x_coco | ViT-S | 36 | 48.7 | 42.7 | model |
imted_mask_rcnn_vit_base_3x_coco | ViT-B | 36 | 53.3 | 46.4 | model |
imted_mask_rcnn_vit_large_3x_coco | ViT-L | 36 | 55.5 | 48.1 | model |
imted_faster_rcnn_vit_base_2x_base_training_coco | ViT-B | 24 | 50.6 | model | |
imted_faster_rcnn_vit_base_2x_finetuning_10shot_coco | ViT-B | 108 | 23.0 | model | |
imted_faster_rcnn_vit_base_2x_finetuning_30shot_coco | ViT-B | 108 | 30.4 | model |
MAE Pre-training
The pre-trained model is trained with the official MAE code. For ViT-S, we use a 4-layer decoder with dimension 256 for 800 epochs of pre-training. For ViT-B, we use an 8-layer decoder with dimension 512 for 1600 epochs of pre-training. Pre-trained weights can be downloaded from the official MAE weight. For ViT-L, we use an 8-layer decoder with dimension 512 for 1600 epochs of pre-training. Pre-trained weights can be downloaded from the official MAE weight.
Last Step of Preparation
For all experiments, remember to modify the path of pre-trained weights in the configuration files, e.g. configs/imted/imted_faster_rcnn_vit_small_3x_coco.py.
For few-shot experiments, please refer to FsDet for data preparation. Remember to modify the path of json in the configuration files, e.g. configs/imted/few_shot/imted_faster_rcnn_vit_base_2x_base_training_coco.py. Json files used for few-shot training and evaluation can also be downloaded from here.
Evaluating with 1 GPU
tools/dist_test.sh "path/to/config/file.py" "path/to/trained/weights.pth" 1 --eval bbox
Training with 8 GPUs
tools/dist_train.sh "path/to/config/file.py" 8
Few-shot Training with 8 GPUs
Base Training
tools/dist_train.sh configs/imted/few_shot/imted_faster_rcnn_vit_base_2x_base_training_coco.py 8
Finetuning
Replace the the ckeckpoint path of your own checkpoint from base training or just use our provided checkpoint here.
tools/dist_train.sh configs/imted/few_shot/imted_faster_rcnn_vit_base_2x_finetuning_30shot_coco.py 8
Acknowledgement
This project is based on MAE, mmdetection and timm. Thanks for their wonderful works.
Some works based on imTED
- Spatial Transform Decoupling for Oriented Object Detection
- AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation
- Proposal Distribution Calibration for Few-Shot Object Detection
Citation
If you find imTED is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
@inproceedings{liu2023integrally,
title={Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection},
author={Liu, Feng and Zhang, Xiaosong and Peng, Zhiliang and Guo, Zonghao and Wan, Fang and Ji, Xiangyang and Ye, Qixiang},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={6825--6834},
year={2023}
}