Home

Awesome

<div align="center"> <h1><a href="https://arxiv.org/abs/2307.08182">Zero-Shot Image Harmonization with <br /> Generative Model Prior</a></h1>

Jianqi Chen, Yilan Zhang, Zhengxia Zou, Keyan Chen, and Zhenwei Shi

GitHub stars

</div>

DiffHarmon's preface

<div align="center"> <a href="https://www.youtube.com/watch?v=mfBTIVp6JBU&t=4s"><img src="assets/thumbnail.png" alt="Watch the video"></a> </div>

Share us a :star: if this repo does help

This is the official repository of Diff-Harmonization. If you encounter any question, please feel free to contact us. You can create an issue or just send email to me windvchen@gmail.com. Also welcome for any idea exchange and discussion.

BTW: You may wish to pay attention to our another work 😊INR-Harmonization. It is the first dense pixel-to-pixel method applicable to high-resolution (~6K) images without any hand-crafted filter design, based on Implicit Neural Representation,.

Updates

[03/10/2024] Release the version 2 of our paper (access it from here, previous paper can still be accessed from here), together with the code! 🧐🧐 In this new version, we mainly have these updates:

[09/05/2023] Code has been publicly accessible.πŸ‘‹πŸ‘‹ We are workingπŸƒπŸƒ on further improvements to the method (see Appendix D of the paper) to provide a better user experience, so stay tuned for more updates.

[07/18/2023] Repository init.

TODO

Possible future work (See Limitation of the paper-v2):

Table of Contents

Abstract

DiffHarmon's framework

We propose a zero-shot approach to image harmonization, aiming to overcome the reliance on large amounts of synthetic composite images in existing methods. These methods, while showing promising results, involve significant training expenses and often struggle with generalization to unseen images. To this end, we introduce a fully modularized framework inspired by human behavior. Leveraging the reasoning capabilities of recent foundation models in language and vision, our approach comprises three main stages. Initially, we employ a pretrained vision-language model (VLM) to generate descriptions for the composite image. Subsequently, these descriptions guide the foreground harmonization direction of a text-to-image generative model (T2I). We refine text embeddings for enhanced representation of imaging conditions and employ self-attention and edge maps for structure preservation. Following each harmonization iteration, an evaluator determines whether to conclude or modify the harmonization direction. The resulting framework, mirroring human behavior, achieves harmonious results without the need for extensive training. We present compelling visual results across diverse scenes and objects, along with a user study validating the effectiveness of our approach.

Requirements

  1. Hardware Requirements

    • GPU: 1x high-end NVIDIA GPU with at least 20GB memory
  2. Software Requirements

    • Python: 3.9 or above
    • CUDA: 11.3
    • cuDNN: 8.4.1

    To install other requirements, please check requirements.txt, or directly run the following command:

    pip install -r requirements.txt
    
  3. Data preparation

    • There have been demo data in demo, you can directly run the code below to see the results.
    • If you want to test your own data, please follow the format of the demo data. Specifically, you need to prepare composite image and mask image, and caption.
    • For automatically generating captions, please run gemini_mini_vision.py. Remember to modify the variables like api_key, images_root, masks_root, etc., in advance.
  4. Pre-trained Models

    • We adopt Stable Diffusion 2.0 as our diffusion model, you can load the pretrained weight by setting pretrained_diffusion_path="stabilityai/stable-diffusion-2-base" in main.py.

Harmonizing

The code supports either harmonize a single image, or harmonize a bunch of images. When the harmonization loop is finished, you can manually select the best one among a number of harmonized results, or directly use the result named final_output which is automatically selected.

(Note: Since Diff-Harmonization is a Zero-Shot method, the results are not always good. If generating bad results, we recommend you to try different initial environmental text to get the best results.)

Harmonize a single image

python main.py --harmonize_iterations 10 --save_dir "./output" --is_single_image --image_path "./demo/girl_comp.jpg" --mask_path "./demo/girl_mask.jpg" --foreground_prompt "girl autumn" --background_prompt "girl winter" --pretrained_diffusion_path "stabilityai/stable-diffusion-2-base" --use_edge_map

Harmonize a bunch of images

python main.py --harmonize_iterations 10 --save_dir "./output" --images_root "./demo/composite" --mask_path "./demo/mask" --caption_txt "./demo/caption.txt" --pretrained_diffusion_path "stabilityai/stable-diffusion-2-base" --use_edge_map

Results

<div align=center><img src="assets/visualizations.png" alt="Visual comparisons3"></div> <div align=center><img src="assets/visualizations2.png" alt="Visual comparisons3"></div> <div align=center><img src="assets/visualizations3.png" alt="Visual comparisons3"></div> <div align=center><img src="assets/visualizations4.png" alt="Visual comparisons3" width=70% height=70%></div>

Citation & Acknowledgments

If you find this paper useful in your research, please consider citing:

@article{chen2023zero,
  title={Zero-Shot Image Harmonization with Generative Model Prior},
  author={Chen, Jianqi and Zou, Zhengxia and Zhang, Yilan and Chen, Keyan and Shi, Zhenwei},
  journal={arXiv preprint arXiv:2307.08182},
  year={2023}
}

Also thanks for the open source code of Prompt-to-Prompt. Some of our codes are based on them.

License

This project is licensed under the Apache-2.0 license. See LICENSE for details.