Home

Awesome

X-Decoder: Generalized Decoding for Pixel, Image, and Language

[Project Page] [Paper] [HuggingFace All-in-One Demo] [HuggingFace Instruct Demo] [Video]

by Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee^, Jianfeng Gao^ in CVPR 2023.

:hot_pepper: Getting Started

<!-- :point_right: *[New]* **One-Line Getting Started:** ```sh sh asset/train.sh # training sh aaset/eval.sh # evaluation ``` -->

:point_right: [New] Latest Checkpoints and Numbers:

COCOADERef-COCOCOCO-Karpathy
BackboneCheckpointPQmAPmIoUPQmAPmIoUmIoUir@1tr@1CIDEr
Focal-Tlast50.839.562.49.623.963.230.048.383.3
Focal-Tbest_open_seg48.837.060.210.129.161.630.248.36
Focal-Llast56.246.465.511.523.667.734.954.4
Focal-Lbest_open_seg51.541.364.111.729.461.530.750.1

Note the number in Table 1 in main paper is after task specific finetuning.

:point_right: [New] Installation, Training, Evaluation, Dataset, and Demo Guide

:fire: News

<p align="center"> <img src="inference_demo/images/teaser_new.png" width="90%" height="90%"> </p>

:paintbrush: DEMO

:blueberries: [X-GPT]   :strawberry:[Instruct X-Decoder]

demo

:notes: Introduction

github_figure

X-Decoder is a generalized decoding model that can generate pixel-level segmentation and token-level texts seamlessly!

It achieves:

It supports:

<!-- ## Getting Started ### Installation ```sh pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu113 python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git' pip install git+https://github.com/cocodataset/panopticapi.git python -m pip install -r requirements.txt sh install_cococapeval.sh export DATASET=/pth/to/dataset ``` Here is the new link to download [coco_caption.zip](https://drive.google.com/file/d/1FHEQNkW7zHvSd-R8CQPC1gIuigC9w8Ff/view?usp=sharing). To prepare the dataset: [DATASET.md](./DATASET.md) ## Open Vocabulary Segmentation ```sh mpirun -n 8 python eval.py evaluate --conf_files configs/xdecoder/svlp_focalt_lang.yaml --overrides WEIGHT /pth/to/ckpt ``` Note: Due to zero-padding, filling a single gpu with multiple images may decrease the performance. ## Inference Demo ```sh # For Segmentation Tasks python demo/demo_semseg.py evaluate --conf_files configs/xdecoder/svlp_focalt_lang.yaml --overrides WEIGHT /pth/to/xdecoder_focalt_best_openseg.pt # For VL Tasks python demo/demo_captioning.py evaluate --conf_files configs/xdecoder/svlp_focalt_lang.yaml --overrides WEIGHT /pth/to/xdecoder_focalt_last_novg.pt ``` ## Model Zoo | | | ADE | | | ADE-full | SUN | SCAN | | SCAN40 | Cityscape | | | BDD | | |-----------|---------|------|------|------|----------|------|------|------|--------|-----------|------|------|------|------| | model | ckpt | PQ | AP | mIoU | mIoU | mIoU | PQ | mIoU | mIoU | PQ | mAP | mIoU | PQ | mIoU | | X-Decoder | [BestSeg Tiny](https://huggingface.co/xdecoder/X-Decoder/resolve/main/xdecoder_focalt_best_openseg.pt) | 19.1 | 10.1 | 25.1 | 6.2 | 35.7 | 30.3 | 38.4 | 22.4 | 37.7 | 18.5 | 50.2 | 16.9 | 47.6 | <!--- | X-Decoder | [Last Tiny](https://projects4jw.blob.core.windows.net/x-decoder/release/xdecoder_focalt_last.pt) | | | | | | | | | | | | | | | X-Decoder | [NoVG Tiny](https://projects4jw.blob.core.windows.net/x-decoder/release/xdecoder_focalt_last_novg.pt) | | | | | | | | | | | | | | --> <!-- * X-Decoder [NoVG Tiny](https://huggingface.co/xdecoder/X-Decoder/resolve/main/xdecoder_focalt_last_novg.pt) * X-Decoder [Last Tiny](https://huggingface.co/xdecoder/X-Decoder/resolve/main/xdecoder_focalt_last.pt) ## Additional Results * Finetuned ADE 150 (32 epochs) | Model | Task | Log | PQ | mAP | mIoU | |---------------------------------|---------|-----|------|------|------| | X-Decoder (davit-d5,Deformable) | PanoSeg | [log](https://projects4jw.blob.core.windows.net/x-decoder/release/ade20k_finetune_davitd5_deform_32epoch_log.txt) | 52.4 | 38.7 | 59.1 | -->

Acknowledgement

Citation

@article{zou2022xdecoder,
  author      = {Zou*, Xueyan and Dou*, Zi-Yi and Yang*, Jianwei and Gan, Zhe and Li, Linjie and Li, Chunyuan and Dai, Xiyang and Wang, Jianfeng and Yuan, Lu and Peng, Nanyun and Wang, Lijuan and Lee*, Yong Jae and Gao*, Jianfeng},
  title       = {Generalized Decoding for Pixel, Image and Language},
  publisher   = {arXiv},
  year        = {2022},
}