Awesome
Awesome Generative Image Composition
A curated list of resources including papers, datasets, and relevant links pertaining to generative image composition (object insertion). Generative image composition aims to generate plausible composite images based on a background image (optional bounding box) and a (resp., a few) foreground image (resp., images) of a specific object. For more complete resources on general image composition, please refer to Awesome-Image-Composition.
<p align='center'> <img src='./figures/task.jpg' width=90% /> </p>Contributing
Contributions are welcome. If you wish to contribute, feel free to send a pull request. If you have suggestions for new sections to be included, please raise an issue and discuss before sending a pull request.
Table of Contents
Survey
A brief review on generative image composition is included in the following survey on image composition:
Li Niu, Wenyan Cong, Liu Liu, Yan Hong, Bo Zhang, Jing Liang, Liqing Zhang: "Making Images Real Again: A Comprehensive Survey on Deep Image Composition." arXiv preprint arXiv:2106.14490 (2021). [arXiv] [slides]
Online Demo
Try this online demo for generative image composition and have fun!
Evaluation Metrics
Test Set
- COCOEE (within-domain, single-ref): 500 background images from MSCOCO validation set. Each background image has a bounding box and a foreground image from MSCOCO training set.
- TF-ICON test benchmark (cross-domain, single-ref): 332 samples. Each sample consists of a background image, a foreground image, a user mask, and a text prompt.
- FOSCom (within-domain, single-ref): 640 background images from Internet. Each background image has a manually annotated bounding box and a foreground image from MSCOCO training set.
- DreamEditBench (within-domain, multi-ref): 220 background images and 30 unique foreground objects from 15 categories.
- MureCom (within-domain, multi-ref): 640 background images and 96 unique foreground objects from 32 categories.
Leaderboard
The training set is open. The test set is COCOEE benchmark. Partial results are copied from ControlCom. Honestly speaking, the following evaluation metrics are not very reliable. For more comprehensive and interpretable evaluation, you can refer to this summary of evaluation metrics.
<table class="tg"> <tr> <th class="tg-0pky" rowspan="2" align="center">Method</th> <th class="tg-0pky" colspan="3" align="center">Foreground</th> <th class="tg-0pky" colspan="2" align="center">Background</th> <th class="tg-0pky" colspan="2" align="center">Overall</th> </tr> <tr> <th class="tg-0pky" align="center">CLIP↑</th> <th class="tg-0pky" align="center">DINO↑</th> <th class="tg-0pky" align="center">FID↓</th> <th class="tg-0pky" align="center">LSSIM↑</th> <th class="tg-0pky" align="center">LPIPS↓</th> <th class="tg-0pky" align="center">FID↓</th> <th class="tg-0pky" align="center">QS↑</th> </tr> <tr> <th class="tg-0pky" align="center">Inpaint&Paste</th> <th class="tg-0pky" align="center">-</th> <th class="tg-0pky" align="center">-</th> <th class="tg-0pky" align="center">8.0</th> <th class="tg-0pky" align="center">-</th> <th class="tg-0pky" align="center">-</th> <th class="tg-0pky" align="center">3.64</th> <th class="tg-0pky" align="center">72.07</th> </tr> <th class="tg-0pky" align="center"><a href="https://arxiv.org/pdf/2211.13227.pdf">PBE</a> </th> <th class="tg-0pky" align="center">84.84</th> <th class="tg-0pky" align="center">52.52</th> <th class="tg-0pky" align="center">6.24</th> <th class="tg-0pky" align="center">0.823</th> <th class="tg-0pky" align="center">0.116</th> <th class="tg-0pky" align="center">3.18</th> <th class="tg-0pky" align="center">77.80</th> </tr> <th class="tg-0pky" align="center"><a href="https://arxiv.org/pdf/2212.00932.pdf">ObjectStitch</a></th> <th class="tg-0pky" align="center">85.97</th> <th class="tg-0pky" align="center">61.12</th> <th class="tg-0pky" align="center">6.86</th> <th class="tg-0pky" align="center">0.825</th> <th class="tg-0pky" align="center">0.116</th> <th class="tg-0pky" align="center">3.35</th> <th class="tg-0pky" align="center">76.86</th> </tr> <th class="tg-0pky" align="center"><a href="https://arxiv.org/pdf/2307.09481.pdf">AnyDoor</a></th> <th class="tg-0pky" align="center">89.7</th> <th class="tg-0pky" align="center">70.16</th> <th class="tg-0pky" align="center">10.5</th> <th class="tg-0pky" align="center">0.870</th> <th class="tg-0pky" align="center">0.109</th> <th class="tg-0pky" align="center">3.60</th> <th class="tg-0pky" align="center">76.18</th> </tr> <th class="tg-0pky" align="center"><a href="https://arxiv.org/pdf/2308.10040.pdf">ControlCom</a></th> <th class="tg-0pky" align="center">88.31</th> <th class="tg-0pky" align="center">63.67</th> <th class="tg-0pky" align="center">6.28</th> <th class="tg-0pky" align="center">0.826</th> <th class="tg-0pky" align="center">0.114</th> <th class="tg-0pky" align="center">3.19</th> <th class="tg-0pky" align="center">77.84</th> </tr> </table>Evaluating Your Results
-
Install Dependencies:
- Begin by installing the dependencies listed in requirements.txt.
- Additionally, install Segment Anything.
-
Clone Repository and Download Pretrained Models:
- Clone this repository and ensure you have a
checkpoints
folder. - Download the following pretrained models into the
checkpoints
folder:- openai/clip-vit-base-patch32: Used for CLIP score and FID score calculations.
- ViT-H SAM model: Utilized to estimate foreground masks for reference images and generated composites.
- facebook/dino-vits16: Employed in DINO score computation.
- coco2017_gmm_k20: Utilized to compute the overall quality score.
The resulting folder structure should resemble the following:
checkpoints/ ├── clip-vit-base-patch32 ├── coco2017_gmm_k20 ├── dino-vits16 └── sam_vit_h_4b8939.pth
- Clone this repository and ensure you have a
- Prepare COCOEE Benchmark and Your Results:
- Prepare the COCOEE benchmark alongside your generated composite results. Ensure that your composite images have filenames corresponding to the background images of the COCOEE dataset, as illustrated below:
results/ ...... ├── 000002228519_GT.png ├── 000002231413_GT.png ├── 900100065455_GT.png └── 900100376112_GT.png
- Modify the paths accordingly in the
run.sh
file. If you have downloaded the cache file mentioned earlier, please ignorecocodir
. - Execute the following command:
sh run.sh
- Prepare the COCOEE benchmark alongside your generated composite results. Ensure that your composite images have filenames corresponding to the background images of the COCOEE dataset, as illustrated below:
Papers
(Object+Text)-to-Object
- Shaoan Xie, Yang Zhao, Zhisheng Xiao, Kelvin C.K. Chan, Yandong Li, Yanwu Xu, Kun Zhang, Tingbo Hou: "DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models." arXiv preprint arXiv:2312.03771 (2023). [arXiv]
- Yulin Pan, Chaojie Mao, Zeyinzi Jiang, Zhen Han, Jingfeng Zhang: "Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance." arXiv preprint arXiv:2403.19534 (2024). [arXiv] [code]
Object-to-Object
- Zitian Zhang, Frederic Fortier-Chouinard, Mathieu Garon, Anand Bhattad, Jean-Francois Lalonde: "ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion." arXiv preprint arXiv:2410.08168 (2024). [arXiv]
- Thinking Outside the BBox: Unconstrained Generative Object Compositing:"Thinking Outside the BBox: Unconstrained Generative Object Compositing." arXiv preprint arXiv:2409.04559 (2024). [arXiv]
- Weijing Tao, Xiaofeng Yang, Biwen Lei, Miaomiao Cui, Xuansong Xie, Guosheng Lin: "MotionCom: Automatic and Motion-Aware Image Composition with LLM and Video Diffusion Prior." arXiv preprint arXiv:2409.10090 (2024). [[arXiv] [code]
- Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Daniel Aliaga: "IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation." CVPR (2024) [arXiv]
- Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, Hengshuang Zhao: "AnyDoor: Zero-shot Object-level Image Customization." CVPR (2024) [arXiv] [code] [demo]
- Vishnu Sarukkai, Linden Li, Arden Ma, Christopher Re, Kayvon Fatahalian: "Collage Diffusion." WACV (2024) [pdf] [code]
- Ziyang Yuan, Mingdeng Cao, Xintao Wang, Zhongang Qi, Chun Yuan, Ying Shan: "CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models." ACM MM (2024) [arXiv] [code] [demo]
- Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, Li Niu: "ControlCom: Controllable Image Composition using Diffusion Model." arXiv preprint arXiv:2308.10040 (2023) [arXiv] [code] [demo]
- Xin Zhang, Jiaxian Guo, Paul Yoo, Yutaka Matsuo, Yusuke Iwasawa: "Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model." arXiv preprint arXiv:2306.07596 (2023) [arXiv] [code]
- Roy Hachnochi, Mingrui Zhao, Nadav Orzech, Rinon Gal, Ali Mahdavi-Amiri, Daniel Cohen-Or, Amit Haim Bermano: "Cross-domain Compositing with Pretrained Diffusion Models." arXiv preprint arXiv:2302.10167 (2023) [arXiv] [code]
- Shilin Lu, Yanzhu Liu, Adams Wai-Kin Kong: "TF-ICON: Diffusion-based Training-free Cross-domain Image Composition." ICCV (2023) [pdf] [code]
- Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen: "Paint by Example: Exemplar-based Image Editing with Diffusion Models." CVPR (2023) [arXiv] [code] [demo]
- Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, Daniel Aliaga: "ObjectStitch: Generative Object Compositing." CVPR (2023) [arXiv] [code]
- Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A. Efros, Krishna Kumar Singh: "Putting People in Their Place: Affordance-Aware Human Insertion into Scenes." CVPR (2023) [paper] [code]
Token-to-Object
-
Lingxiao Lu, Bo Zhang, Li Niu: "DreamCom: Finetuning Text-guided Inpainting Model for Image Composition." arXiv preprint arXiv:2309.15508 (2023) [arXiv] [code]
-
Tianle Li, Max Ku, Cong Wei, Wenhu Chen: "DreamEdit: Subject-driven Image Editing." TMLR (2023) [arXiv] [code]
Related Topics
Foreground: 3D; Background: image
- Jinghao Zhou, Tomas Jakab, Philip Torr, Christian Rupprecht: "Scene-Conditional 3D Object Stylization and Composition." arXiv preprint arXiv:2312.12419 (2023) [arXiv] [code]
Foreground: 3D; Background: 3D
- Mohamad Shahbazi, Liesbeth Claessens, Michael Niemeyer, Edo Collins, Alessio Tonioni, Luc Van Gool, Federico Tombari: "InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes." arXiv preprint arXiv:2401.05335 (2024) [arXiv]
- Rahul Goel, Dhawal Sirikonda, Saurabh Saini, PJ Narayanan: "Interactive Segmentation of Radiance Fields." CVPR (2023) [arXiv] [code]
- Rahul Goel, Dhawal Sirikonda, Rajvi Shah, PJ Narayanan: "FusedRF: Fusing Multiple Radiance Fields." CVPR Workshop (2023) [arXiv]
- Verica Lazova, Vladimir Guzov, Kyle Olszewski, Sergey Tulyakov, Gerard Pons-Moll: "Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation." WACV (2023) [arXiv]
- Jiaxiang Tang, Xiaokang Chen, Jingbo Wang, Gang Zeng: "Compressible-composable NeRF via Rank-residual Decomposition." NIPS (2022) [arXiv] [code]
- Bangbang Yang, Yinda Zhang, Yinghao Xu, Yijin Li, Han Zhou, Hujun Bao, Guofeng Zhang, Zhaopeng Cui: "Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering." ICCV (2021) [arXiv] [code]
Foreground: video; Background: image
- Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang: "ActAnywhere: Subject-Aware Video Background Generation." arXiv preprint arXiv:2401.10822 (2024) [arXiv]
Foreground: video; Background: video
-
Jiaqi Guo, Sitong Su, Junchen Zhu, Lianli Gao, Jingkuan Song: "Training-Free Semantic Video Composition via Pre-trained Diffusion Model." arXiv preprint arXiv:2401.09195 (2024) [arXiv]
-
Donghoon Lee, Tomas Pfister, Ming-Hsuan Yang: "Inserting Videos into Videos." CVPR (2019) [pdf]