Awesome
Stitched ViTs are Flexible Vision Backbones
This is the official PyTorch implementation for Stitched ViTs are Flexible Vision Backbones.
By Zizheng Pan, Jing Liu, Haoyu He, Jianfei Cai, and Bohan Zhuang.
We adapt the framework of stitchable neural networks (SN-Net) into downstream dense prediction tasks. Compared to SNNetv1, the new framework consistently improves the performance at low FLOPs while maintaining competitive performance at high FLOPs across different datasets, thus obtaining a better Pareto frontier (highlighted in lines).
📰 News
- 23/01/2024. Release code for depth estimation on NYUv2, see the subfolder depth_estimation.
- 18/01/2024. Huggingface online demo for image classification is live! Checkout here.
- 18/01/2024. Release code for semantic segmentation on ADE20K and COCO-Stuff-10K, see the subfolder segmentation.
- 13/01/2024. Release code on ImageNet-1K classification 🔥. The classification code is an easy way to start understanding how SN-Netv2 works and how it differs from V1.
💪 Getting Started
For image classification on ImageNet-1K, please refer to classification.
For semantic segmentation on ADE20K and COCO-Stuff-10K, please refer to segmentation.
For depth estimation on NYUv2, please refer to depth_estimation.
🪄 Gradio Demo for Segmentation
First, install gradio by
pip install gradio
Next, install the required packages at segmentation, then run the gradio demo by
cd segmentation
python demo/video_demo_gradio.py --config [path/to/config] --checkpoint [path/to/checkpoint]
✨ Results
Understand the figures:
- Each point represents for a stitch in SN-Net, which can be instantly selected at runtime without additional training cost.
- SN-Netv2 can produces 10x more stitches than SN-Netv1. For better comparison, we highlight the Pareto frontier in SN-Netv2.
- The yellow star represents for adopting an individual ViT as backbone for downstream task adaptation.
- All models are trained under the same training iterations/epochs.
Image Classification on ImageNet-1K
Semantic Segmentation on ADE20K and COCO-Stuff-10K
ADE20K | COCO-Stuff-10K |
---|---|
Depth Estimation on NYUv2
<figure> <center> <figcaption>Stitching DeiT3-S and DeiT3-L based on DPT.</figcaption></center> <img src=".github/depth_estimation.png"> </figure>Object Detection and Instance Segmentation on COCO-2017
<figure> <center> <figcaption>Stitching DeiT3-S and DeiT3-L based on Mask R-CNN/ViTDet.</figcaption></center> <img src=".github/coco_res.jpg"> </figure>Training Efficiency Comparison
🚧 TODO List
-
Classification code
-
Segmentation code
-
Depth estimation code
-
Detection code
-
Gradio demo
✍ Citation
If you use SN-Netv2 in your research, please consider the following BibTeX entry and giving us a star 🌟.
@article{pan2023snnetv2,
title={Stitched ViTs are Flexible Vision Backbones},
author={Pan, Zizheng and Liu, Jing and He, Haoyu and Cai, Jianfei and Zhuang, Bohan},
journal={arXiv},
year={2023}
}
If you find the code useful, please also consider the following BibTeX entry
@inproceedings{pan2023snnetv1,
title = {Stitchable Neural Networks},
author = {Pan, Zizheng and Cai, Jianfei and Zhuang, Bohan},
booktitle = {CVPR},
year = {2023},
}
License
This repository is released under the Apache 2.0 license as found in the LICENSE file.