Home

Awesome

<div align="center">

Integrally Pre-Trained Transformer Pyramid Networks

Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration

[CVPR2023/TPAMI2024]

(A Simple Hierarchical Vision Transformer Meets Masked Image Modeling)

<p align="center"> <img src="assets/framework.png" alt="iTPN" width="90%"> </p> <p align="center"> Figure 1: The comparison between a conventional pre-training (left) and the proposed integral pre-training framework (right). We use a feature pyramid as the unified neck module and apply masked feature modeling for pre-training the feature pyramid. The green and red blocks indicate that the network weights are pre-trained and un-trained (i.e., randomly initialized for fine-tuning), respectively. </p> </div>

Updates

11/Jul./2024

Fast-iTPN is accepted by TPAMI2024.

08/Jan./2024

Fast-iTPN is public at arxiv. Fast-iTPN is a more powerful version of iTPN.

26/Dec./2023

modelPara. (M)Pre-trainteacherinput/patch21K ft?Acc on IN.1Kcheckpointcheckpoint (21K)
Fast-iTPN-T24IN.1KCLIP-L224/16N85.1%baidu/google
Fast-iTPN-T24IN.1KCLIP-L384/16N86.2%
Fast-iTPN-T24IN.1KCLIP-L512/16N86.5%
Fast-iTPN-S40IN.1KCLIP-L224/16N86.4%baidu/google
Fast-iTPN-S40IN.1KCLIP-L384/16N86.95%
Fast-iTPN-S40IN.1KCLIP-L512/16N87.8%
Fast-iTPN-B85IN.1KCLIP-L224/16N87.4%baidu/google
Fast-iTPN-B85IN.1KCLIP-L512/16N88.5%
Fast-iTPN-B85IN.1KCLIP-L512/16Y88.75%baidu/google
Fast-iTPN-L312IN.1KCLIP-L640/16N89.5%baidu/google

All the pre-trained Fast-iTPN models are available now (passward: itpn) ! The tiny/small/base scale models report the best performance on ImageNet-1K as far as we know. Use them for your own tasks! See Details.

30/May/2023

modelPre-trainteacherinput/patch21K ft?Acc on IN.1K
EVA-02-BIN.21KEVA-CLIP-g196/14N87.0%
EVA-02-BIN.21KEVA-CLIP-g448/14N88.3%
EVA-02-BIN.21KEVA-CLIP-g448/14Y88.6%
Fast-iTPN-BIN.1KCLIP-L224/16N87.4%
Fast-iTPN-BIN.1KCLIP-L512/16N88.5%
Fast-iTPN-BIN.1KCLIP-L512/16Y88.7%

All the models above are only pre-trained on ImageNet-1K and these models will be available soon.

29/May/2023

The iTPN-L-CLIP/16 intermediate fine-tuned model is available (password:itpn) pretrained on 21K, and fine-tuned on 1K. Evaluating the latter one on ImageNet-1K obtains 89.2% accuracy.

28/Feb./2023

iTPN is accepted by CVPR2023!

08/Feb./2023

The iTPN-L-CLIP/16 model reaches 89.2% fine-tuning performance on ImageNet-1K.

configurations: intermediate fine-tuning on ImageNet-21K + 384 input size

21/Jan./2023

Our HiViT is accepted by ICLR2023!

HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer

08/Dec./2022

Get checkpoints (password: abcd):

iTPN-B-pixeliTPN-B-CLIPiTPN-L-pixeliTPN-L-CLIP/16
baidu drivedownloaddownloaddownloaddownload
google drivedownloaddownloaddownloaddownload

25/Nov./2022

The preprint version is public at arxiv.

Requiments

Dataset

Get Started

Prepare the environment:

conda create --name itpn python=3.8 -y
conda activate itpn

git clone git@github.com:sunsmarterjie/iTPN.git
cd iTPN

pip install torch==1.7.1+cu10.2 torchvision==0.8.2+cu10.2 timm==0.3.2 tensorboard einops

iTPN supports pre-training using pixel and CLIP as supervision. For the latter, please first download the CLIP models (We use CLIP-B/16 and CLIP-L/14 models in the paper).

Main Results

<p align="center"> <img src="assets/ft_in1k.jpg" alt="iTPN" width="40%"> </p> <p align="center"> Table 1: Top-1 classification accuracy (%) by fine-tuning the pre-trained models on ImageNet-1K. We compare models of different levels and supervisions (e.g., with and without CLIP) separately. </p> <p align="center"> <img src="assets/ft_coco_ade.jpg" alt="iTPN" width="70%"> </p> <p align="center"> Table 2: Visual recognition results (%) on COCO and ADE20K. Mask R-CNN (abbr. MR, 1x/3x) and Cascade Mask R-CNN (abbr. CMR, 1x) are used on COCO, and UPerHead with 512x512 input is used on ADE20K. For the base-level models, each cell of COCO results contains object detection (box) and instance segmentation (mask) APs. For the large-level models, the accuracy of 1x Mask R-CNN surpasses all existing methods. </p>

License

iTPN is released under the License.

Citation

@article{tian2024fast,
  title={Fast-iTPN: Integrally pre-trained transformer pyramid network with token migration},
  author={Tian, Yunjie and Xie, Lingxi and Qiu, Jihao and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}
@inproceedings{tian2023integrally,
  title={Integrally pre-trained transformer pyramid networks},
  author={Tian, Yunjie and Xie, Lingxi and Wang, Zhaozhi and Wei, Longhui and Zhang, Xiaopeng and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18610--18620},
  year={2023}
}
@inproceedings{zhang2023hivit,
  title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
  author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
  booktitle={International Conference on Learning Representations},
  year={2023}
}