Awesome
X-Decoder: Generalized Decoding for Pixel, Image, and Language
[Project Page] [Paper] [HuggingFace All-in-One Demo] [HuggingFace Instruct Demo] [Video]
by Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee^, Jianfeng Gao^ in CVPR 2023.
:hot_pepper: Getting Started
<!-- :point_right: *[New]* **One-Line Getting Started:** ```sh sh asset/train.sh # training sh aaset/eval.sh # evaluation ``` -->:point_right: [New] Latest Checkpoints and Numbers:
COCO | ADE | Ref-COCO | COCO-Karpathy | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Backbone | Checkpoint | PQ | mAP | mIoU | PQ | mAP | mIoU | mIoU | ir@1 | tr@1 | CIDEr |
Focal-T | last | 50.8 | 39.5 | 62.4 | 9.6 | 23.9 | 63.2 | 30.0 | 48.3 | 83.3 | |
Focal-T | best_open_seg | 48.8 | 37.0 | 60.2 | 10.1 | 29.1 | 61.6 | 30.2 | 48.36 | ||
Focal-L | last | 56.2 | 46.4 | 65.5 | 11.5 | 23.6 | 67.7 | 34.9 | 54.4 | ||
Focal-L | best_open_seg | 51.5 | 41.3 | 64.1 | 11.7 | 29.4 | 61.5 | 30.7 | 50.1 |
Note the number in Table 1 in main paper is after task specific finetuning.
:point_right: [New] Installation, Training, Evaluation, Dataset, and Demo Guide
:fire: News
- [2023.07.19] :roller_coaster: We are excited to release the x-decoder training code (INSTALL.md, DATASET.md, TRAIN.md, EVALUATION.md)!
- [2023.07.10] We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!
- [2023.04.14] We are releasing SEEM, a new universal interactive interface for image segmentation! You can use it for any segmentation tasks, way beyond what X-Decoder can do!
- [2023.03.20] As an aspiration of our X-Decoder, we developed OpenSeeD ([Paper][Code]) to enable open-vocabulary segmentation and detection with a single model, Check it out!
- [2023.03.14] We release X-GPT which is an conversational version of our X-Decoder through GPT-3 langchain!
- [2023.03.01] The Segmentation in the Wild Challenge had been launched and ready for submitting results!
- [2023.02.28] We released the SGinW benchmark for our challenge. Welcome to build your own models on the benchmark!
- [2023.02.27] Our X-Decoder has been accepted by CVPR 2023!
- [2023.02.07] We combine <ins>X-Decoder</ins> (strong image understanding), <ins>GPT-3</ins> (strong language understanding) and <ins>Stable Diffusion</ins> (strong image generation) to make an instructional image editing demo, check it out!
- [2022.12.21] We release inference code of X-Decoder.
- [2022.12.21] We release Focal-T pretrained checkpoint.
- [2022.12.21] We release open-vocabulary segmentation benchmark.
:paintbrush: DEMO
:blueberries: [X-GPT] :strawberry:[Instruct X-Decoder]
:notes: Introduction
X-Decoder is a generalized decoding model that can generate pixel-level segmentation and token-level texts seamlessly!
It achieves:
- State-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets;
- Better or competitive finetuned performance to generalist and specialist models on segmentation and VL tasks;
- Friendly for efficient finetuning and flexible for novel task composition.
It supports:
- One suite of parameters pretrained for Semantic/Instance/Panoptic Segmentation, Referring Segmentation, Image Captioning, and Image-Text Retrieval;
- One model architecture finetuned for Semantic/Instance/Panoptic Segmentation, Referring Segmentation, Image Captioning, Image-Text Retrieval and Visual Question Answering (with an extra cls head);
- Zero-shot task composition for Region Retrieval, Referring Captioning, Image Editing.
Acknowledgement
- We appreciate the contructive dicussion with Haotian Zhang
- We build our work on top of Mask2Former
- We build our demos on HuggingFace :hugs: with sponsored GPUs
- We appreciate the discussion with Xiaoyu Xiang during rebuttal
Citation
@article{zou2022xdecoder,
author = {Zou*, Xueyan and Dou*, Zi-Yi and Yang*, Jianwei and Gan, Zhe and Li, Linjie and Li, Chunyuan and Dai, Xiyang and Wang, Jianfeng and Yuan, Lu and Peng, Nanyun and Wang, Lijuan and Lee*, Yong Jae and Gao*, Jianfeng},
title = {Generalized Decoding for Pixel, Image and Language},
publisher = {arXiv},
year = {2022},
}