Awesome

<h1 align="center"><a href="https://mlpc-ucsd.github.io/TokenCompose/">🧩 TokenCompose</a>: Text-to-Image Diffusion with Token-level Supervision</h1> <a href="https://zwcolin.github.io/">Zirui Wang</a>1, 3 · <a href="https://jamessand.github.io/">Zhizhou Sha</a>2, 3 · <a href="https://github.com/zh-ding">Zheng Ding</a>3 · <a href="https://github.com/modric197">Yilin Wang</a>2, 3 · <a href="https://pages.ucsd.edu/~ztu/">Zhuowen Tu</a>3 1Princeton University · 2Tsinghua University · 3University of California, San Diego CVPR 2024 Project done while Zirui Wang, Zhizhou Sha and Yilin Wang interned at UC San Diego. <h3 align="center"> <a href="https://mlpc-ucsd.github.io/TokenCompose/">Project Page</a> | <a href="https://arxiv.org/abs/2312.03626">arXiv</a> | <a href="https://x.com/zwcolin/status/1732578746949837205?s=46&t=_jLYQtkGRBhT0cOPjbEiiQ">X (Twitter)</a> </h3>

Updates

If you use our method and/or model for your research project, we are happy to provide cross-reference here in the updates. :)

[04/04/2024] 🔥 Our training methodology is incorporated into CoMat which shows enhanced text-to-image attribute assignments.
[02/26/2024] 🔥 TokenCompose is accepted to CVPR 2024!
[02/20/2024] 🔥 TokenCompose is used as a base model from the RealCompo paper for enhanced compositionality.

https://github.com/mlpc-ucsd/TokenCompose/assets/59942464/93feea16-4eac-49c3-b286-ee390a325b17

A Stable Diffusion model finetuned with token-level consistency terms for enhanced multi-category instance composition and photorealism. <div align="center"> <img src="teaser.jpg" alt="Logo" width="100%"> </div> <table> <tr> <th rowspan="3" align="center">Method</th> <th colspan="9" align="center">Multi-category Instance Composition</th> <th colspan="2" align="center">Photorealism</th> <th colspan="1" align="center">Efficiency</th> </tr> <tr>  <th rowspan="2" align="center">Object Accuracy</th> <th colspan="4" align="center">COCO</th> <th colspan="4" align="center">ADE20K</th> <th rowspan="2" align="center">FID (COCO)</th> <th rowspan="2" align="center">FID (Flickr30K)</th> <th rowspan="2" align="center">Latency</th> </tr> <tr>  <th align="center">MG2</th> <th align="center">MG3</th> <th align="center">MG4</th> <th align="center">MG5</th> <th align="center">MG2</th> <th align="center">MG3</th> <th align="center">MG4</th> <th align="center">MG5</th> </tr> <tr> <td align="center"><a href="https://huggingface.co/CompVis/stable-diffusion-v1-4">SD 1.4</a></td> <td align="center">29.86</td> <td align="center">90.721.33</td> <td align="center">50.740.89</td> <td align="center">11.680.45</td> <td align="center">0.880.21</td> <td align="center">89.810.40</td> <td align="center">53.961.14</td> <td align="center">16.521.13</td> <td align="center">1.890.34</td> <td align="center">20.88</td> <td align="center">71.46</td> <td align="center">7.540.17</td> </tr> <tr> <td align="center"><a href="https://github.com/energy-based-model/Compositional-Visual-Generation-with-Composable-Diffusion-Models-PyTorch">Composable</a></td> <td align="center">27.83</td> <td align="center">63.330.59</td> <td align="center">21.871.01</td> <td align="center">3.250.45</td> <td align="center">0.230.18</td> <td align="center">69.610.99</td> <td align="center">29.960.84</td> <td align="center">6.890.38</td> <td align="center">0.730.22</td> <td align="center">-</td> <td align="center">75.57</td> <td align="center">13.810.15</td> </tr> <tr> <td align="center"><a href="https://github.com/silent-chen/layout-guidance">Layout</a></td> <td align="center">43.59</td> <td align="center">93.220.69</td> <td align="center">60.151.58</td> <td align="center">19.490.88</td> <td align="center">2.270.44</td> <td align="center">96.050.34</td> <td align="center">67.830.90</td> <td align="center">21.931.34</td> <td align="center">2.350.41</td> <td align="center">-</td> <td align="center">74.00</td> <td align="center">18.890.20</td> </tr> <tr> <td align="center"><a href="https://github.com/weixi-feng/Structured-Diffusion-Guidance">Structured</a></td> <td align="center">29.64</td> <td align="center">90.401.06</td> <td align="center">48.641.32</td> <td align="center">10.710.92</td> <td align="center">0.680.25</td> <td align="center">89.250.72</td> <td align="center">53.051.20</td> <td align="center">15.760.86</td> <td align="center">1.740.49</td> <td align="center">21.13</td> <td align="center">71.68</td> <td align="center">7.740.17</td> </tr> <tr> <td align="center"><a href="https://github.com/yuval-alaluf/Attend-and-Excite">Attn-Exct</a></td> <td align="center">45.13</td> <td align="center">93.640.76</td> <td align="center">65.101.24</td> <td align="center">28.010.90</td> <td align="center">6.010.61</td> <td align="center">91.740.49</td> <td align="center">62.510.94</td> <td align="center">26.120.78</td> <td align="center">5.890.40</td> <td align="center">-</td> <td align="center">71.68</td> <td align="center">25.434.89</td> </tr> <tr> <td align="center"><a href="https://github.com/mlpc-ucsd/TokenCompose">TokenCompose (Ours)</a></td> <td align="center">52.15</td> <td align="center">98.080.40</td> <td align="center">76.161.04</td> <td align="center">28.810.95</td> <td align="center">3.280.48</td> <td align="center">97.750.34</td> <td align="center">76.931.09</td> <td align="center">33.921.47</td> <td align="center">6.210.62</td> <td align="center">20.19</td> <td align="center">71.13</td> <td align="center">7.560.14</td> </tr> </table>

🆕 Models

Stable Diffusion Version	Checkpoint 1	Checkpoint 2
v1.4	TokenCompose_SD14_A	TokenCompose_SD14_B
v2.1	TokenCompose_SD21_A	TokenCompose_SD21_B

Our finetuned models do not contain any extra modules and can be directly used in a standard diffusion model library (e.g., HuggingFace's Diffusers) by replacing the pretrained U-Net with our finetuned U-Net in a plug-and-play manner. We provide a demo jupyter notebook which uses our model checkpoint to generate images.

You can also use the following code to download our checkpoints and generate images:

import torch
from diffusers import StableDiffusionPipeline

model_id = "mlpc-lab/TokenCompose_SD14_A"
device = "cuda"

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
pipe = pipe.to(device)

prompt = "A cat and a wine glass"
image = pipe(prompt).images[0]  
    
image.save("cat_and_wine_glass.png")

📊 MultiGen

See MultiGen for details.

<table> <tr> <th rowspan="2" align="center">Method</th> <th colspan="4" align="center">COCO</th> <th colspan="4" align="center">ADE20K</th> </tr> <tr>  <th align="center">MG2</th> <th align="center">MG3</th> <th align="center">MG4</th> <th align="center">MG5</th> <th align="center">MG2</th> <th align="center">MG3</th> <th align="center">MG4</th> <th align="center">MG5</th> </tr> <tr> <td align="center"><a href="https://huggingface.co/CompVis/stable-diffusion-v1-4">SD 1.4</a></td> <td align="center">90.721.33</td> <td align="center">50.740.89</td> <td align="center">11.680.45</td> <td align="center">0.880.21</td> <td align="center">89.810.40</td> <td align="center">53.961.14</td> <td align="center">16.521.13</td> <td align="center">1.890.34</td> </tr> <tr> <td align="center"><a href="https://github.com/energy-based-model/Compositional-Visual-Generation-with-Composable-Diffusion-Models-PyTorch">Composable</a></td> <td align="center">63.330.59</td> <td align="center">21.871.01</td> <td align="center">3.250.45</td> <td align="center">0.230.18</td> <td align="center">69.610.99</td> <td align="center">29.960.84</td> <td align="center">6.890.38</td> <td align="center">0.730.22</td> </tr> <tr> <td align="center"><a href="https://github.com/silent-chen/layout-guidance">Layout</a></td> <td align="center">93.220.69</td> <td align="center">60.151.58</td> <td align="center">19.490.88</td> <td align="center">2.270.44</td> <td align="center">96.050.34</td> <td align="center">67.830.90</td> <td align="center">21.931.34</td> <td align="center">2.350.41</td> </tr> <tr> <td align="center"><a href="https://github.com/weixi-feng/Structured-Diffusion-Guidance">Structured</a></td> <td align="center">90.401.06</td> <td align="center">48.641.32</td> <td align="center">10.710.92</td> <td align="center">0.680.25</td> <td align="center">89.250.72</td> <td align="center">53.051.20</td> <td align="center">15.760.86</td> <td align="center">1.740.49</td> </tr> <tr> <td align="center"><a href="https://github.com/yuval-alaluf/Attend-and-Excite">Attn-Exct</a></td> <td align="center">93.640.76</td> <td align="center">65.101.24</td> <td align="center">28.010.90</td> <td align="center">6.010.61</td> <td align="center">91.740.49</td> <td align="center">62.510.94</td> <td align="center">26.120.78</td> <td align="center">5.890.40</td> </tr> <tr> <td align="center"><a href="https://github.com/mlpc-ucsd/TokenCompose">Ours</a></td> <td align="center">98.080.40</td> <td align="center">76.161.04</td> <td align="center">28.810.95</td> <td align="center">3.280.48</td> <td align="center">97.750.34</td> <td align="center">76.931.09</td> <td align="center">33.921.47</td> <td align="center">6.210.62</td> </tr> </table>

💻 Environment Setup

For those who want to use our codebase to train your own diffusion models with token-level objectives, follow the below instructions:

conda create -n TokenCompose python=3.8.5
conda activate TokenCompose
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt

We have verified the environment setup using this specific package versions, but we expect that it will also work for newer versions too!

🛠️ Dataset Setup

If you want to use your own data, please refer to preprocess_data for details.

If you want to use our training data as examples or for research purposes, please follow the below instructions:

1. Setup the COCO Image Data

cd train/data
# download COCO train2017
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip
rm train2017.zip
bash coco_data_setup.sh

After this step, you should have the following structure under the train/data directory:

train/data/
    coco_gsam_img/
        train/
            000000000142.jpg
            000000000370.jpg
            ...

2. Setup Token-wise Grounded Segmentation Maps

Download COCO segmentation data from Google Drive and put it under train/data directory.

After this step, you should have the following structure under the train/data directory:

train/data/
    coco_gsam_img/
        train/
            000000000142.jpg
            000000000370.jpg
            ...
    coco_gsam_seg.tar

Then, run the following command to unzip the segmentation data:

cd train/data
tar -xvf coco_gsam_seg.tar
rm coco_gsam_seg.tar

After the setup, you should have the following structure under the train/data directory:

train/data/
    coco_gsam_img/
        train/
            000000000142.jpg
            000000000370.jpg
            ...
    coco_gsam_seg/
        000000000142/
            mask_000000000142_bananas.png
            mask_000000000142_bread.png
            ...
        000000000370/
            mask_000000000370_bananas.png
            mask_000000000370_bread.png
            ...
        ...

📈 Training

We use wandb to log some curves and visualizations. Login to wandb before running the scripts.

wandb login

Then, to run TokenCompose, use the following command:

cd train
bash train.sh

The results will be saved under train/results directory.

🏷️ License

This repository is released under the Apache 2.0 license.

🙏 Acknowledgement

Our code is built upon diffusers, prompt-to-prompt, VISOR, Grounded-Segment-Anything, and CLIP. We thank all these authors for their nicely open sourced code and their great contributions to the community.

📝 Citation

If you find our work useful, please consider citing:

@InProceedings{Wang2024TokenCompose,
    author    = {Wang, Zirui and Sha, Zhizhou and Ding, Zheng and Wang, Yilin and Tu, Zhuowen},
    title     = {TokenCompose: Text-to-Image Diffusion with Token-level Supervision},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {8553-8564}
}