Awesome
<div align="center"> <h1> ViTMatte🐒</h1> <h3> Boosting Image Matting with Pretrained Plain Vision Transformers</h3>Jingfeng Yao<sup>1</sup>, Xinggang Wang<sup>1 📧</sup>, Shusheng Yang<sup>1</sup>, Baoyuan Wang<sup>2</sup>
<sup>1</sup> School of EIC, HUST, <sup>2</sup> Xiaobing.AI
(<sup>📧</sup>) corresponding author.
</div>News
-
May 24th, 2024
: ViTMatte has been brought to The Foundry's Nuke. Here is a bilibili tutorial. Thanks a lot! -
Oct 19th, 2023
: ViTMatte has been accepted by Information Fusion (IF=18.6)! -
Sep 21th, 2023
: ViTMatte is now available in 🤗HuggingFace Transformers! Many thanks to Niels! -
June 12th, 2023
: We released google colab demo. Try ViTMatte online! -
June 9th, 2023
: Many thanks to Lucas for creating ViT and twitting our ViTMatte paper! -
June 8th, 2023
: Matte Anything is released! If you like ViTMatte, you may also like Matte Anything. -
May 27th, 2023
: We released pretrained weights of ViTMatte! -
May 25th, 2023
: We released codes of ViTMatte. The pretrained models will be coming soon! -
May 24th, 2023
: We released our paper on arxiv.
Introduction
<div align="center"><h4>Plain Vision Transformer could also do image matting with simple ViTMatte framework!</h4></div>Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.
Get Started
Demo
You could try to matting the demo image with its corresponding trimap by run:
python run_one_image.py \
--model vitmatte-s \
--checkpoint-dir path/to/checkpoint
The demo images will be saved in ./demo
.
You could also try with your own image and trimap with the same file.
Besides, you can also try ViTMatte in . It is a simple demo to show the ability of ViTMatte.
Results
Quantitative Results on Composition-1k
Model | SAD | MSE | Grad | Conn | checkpoints |
---|---|---|---|---|---|
ViTMatte-S | 21.46 | 3.3 | 7.24 | 16.21 | GoogleDrive |
ViTMatte-B | 20.33 | 3.0 | 6.74 | 14.78 | GoogleDrive |
Quantitative Results on Distinctions-646
Model | SAD | MSE | Grad | Conn | checkpoints |
---|---|---|---|---|---|
ViTMatte-S | 21.22 | 2.1 | 8.78 | 17.55 | GoogleDrive |
ViTMatte-B | 17.05 | 1.5 | 7.03 | 12.95 | GoogleDrive |
Citation
@article{yao2024vitmatte,
title={ViTMatte: Boosting image matting with pre-trained plain vision transformers},
author={Yao, Jingfeng and Wang, Xinggang and Yang, Shusheng and Wang, Baoyuan},
journal={Information Fusion},
volume={103},
pages={102091},
year={2024},
publisher={Elsevier}
}