Awesome

Cocktail🍸: Mixing Multi-Modality Controls for Text-Conditional Image Generation

James Bond is drinking Cocktail🍸.

https://github.com/mhh0318/Cocktail/assets/42776955/e2a93a6d-3e36-4e54-8462-b359fa8946fa

Our approach requires only [one generalized model], unlike previous that needed multiple models for mixing multiple modalities.

Different from currently existing schemes, our scheme does not require modifications to the modal prior of the base model Fig.(a), which results in a significant reduction in cost. Also in the face of multiple modalities we do not need multiple models demonstrated in Fig.(b). Cocktail🍸 fuse the information from multiple modalities like Fig.(c) shown.

Abstract

We propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models.

Pipeline

The parameters indicated by the yellow sections are sourced from the pre-trained model and stay constant, while only those in the blue sections are updated during training, with the gradient back-propagated along the blue arrows. The light grey dashed sections signify additional operations that occur solely during the inference process, specifically, the process of storing attention maps derived from the gControlNet for the sampling stage.

Results

[Examples] Cocktail for Multi-modality

[Examples] Cocktail for free-modality

[Comparisons] single-modality

[Comparisons] multi-modality

Here, the "cross" symbol ❌ and the checkmark symbol ✅ denote the unmatched and matched modalities, respectively. It is important to note that our model accurately captures all modalities.

TODO

Release Gradio Demo
Release sampling codes
Release inference codes
Release pre-trained models

Setup

Installation Requirmenets

You can create an anaconda environment called cocktail with the required dependencies by running:

git clone https://github.com/mhh0318/cocktail.git
cd cocktail
conda env create -f environment.yaml

Download Pretrained Weights

Download the pretrained models from here, and save it to the root dir.

Gradio Demo

Gradio demo can be launched by:

python gradio_demo.py [--share]

Annotations

We use HED, SAN, and OpenPose to extract the sketch map, segmentation map, and human pose map from the image.

Extract sketch map:

python annotator/hed.py {/path/to/image.png} {/path/to/sketch.png}

Extract segmentation map:

python annotator/SAN/run.py {/path/to/image.png} {/path/to/seg.png}

Extract human pose map:

python annotator/openpose/run.py {/path/to/image.png} {/path/to/openpose.png}

Quick Inference

For the simultaneous vision-language generation, please run:

python ./inference {args}

args here can be int 0 or 1, as the provided two example conditions.

If the environment is setup correctly, this command should function properly and generate some results in the folder ./samples/results/{args}_sample_{batch}.png.

Comments

Our codebase for the diffusion models builds heavily on ControlNet and Stable Diffusion.

Thanks for the opensourcing!

Citation

If you use this code for your research, please cite our paper.

@article{hu2023cocktail,
  title = {Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation},
  author = {Hu, Minghui and Zheng, Jianbin and Liu, Daqing and Zheng, Chuanxia and Wang, Chaoyue and Tao, Dacheng and Cham, Tat-Jen},
  journal = {arXiv},
  year = {2023},
}