Home

Awesome

<!-- !# <p align=center> Multimodal Image Synthesis and Editing: <br> A Survey and Taxonomy</p> --> <img src='title.png' align="center"> <br>

arXiv Survey Maintenance PR's Welcome GitHub license

<!-- [![made-with-Markdown](https://img.shields.io/badge/Made%20with-Markdown-1f425f.svg)](http://commonmark.org) --> <!-- [![Documentation Status](https://readthedocs.org/projects/ansicolortags/badge/?version=latest)](http://ansicolortags.readthedocs.io/?badge=latest) --> <img src='teaser.gif' align="center">

This project is associated with our survey paper which comprehensively contextualizes the advance of Multimodal Image Synthesis & Editing (MISE) and visual AIGC by formulating taxonomies according to data modality and model architectures.

<img src='logo.png' align="center" width=20> Multimodal Image Synthesis and Editing: The Generative AI Era [Paper] [Project] <br> Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang, Shijian Lu, Lingjie Liu, Adam Kortylewsk, <br> Christian Theobalt, Eric Xing <br> IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

<!---[DeepAI](https://deepai.org/publication/multimodal-image-synthesis-and-editing-a-survey).**--> <br>

PR's Welcome You are welcome to promote papers via pull request. <br> The process to submit a pull request:

**Title**<br>
*Author*<br>
Conference
[[Paper](Paper link)]
[[Code](Project link)]
[[Project](Code link)]
<br>

Related Surveys & Projects

Adversarial Text-to-Image Synthesis: A Review<br> Stanislav Frolov, Tobias Hinz, Federico Raue, Jörn Hees, Andreas Dengel<br> Neural Networks 2021 [Paper]

GAN Inversion: A Survey<br> Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, Ming-Hsuan Yang<br> TPAMI 2022 [Paper] [Project]

Deep Image Synthesis from Intuitive User Input: A Review and Perspectives<br> Yuan Xue, Yuan-Chen Guo, Han Zhang, Tao Xu, Song-Hai Zhang, Xiaolei Huang<br> Computational Visual Media 2022 [Paper]

Awesome-Text-to-Image

<br>

Table of Contents (Work in Progress)

Methods:

<!-- ### Methods: -->

Modalities & Datasets:

Neural-Rendering-Methods

ATT3D: Amortized Text-to-3D Object Synthesis<br> Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, James Lucas<br> arxiv 2023 [Paper]

TADA! Text to Animatable Digital Avatars<br> Tingting Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang, Yangyi Huang, Justus Thies, Michael J. Black<br> arxiv 2023 [Paper]

MATLABER: Material-Aware Text-to-3D via LAtent BRDF auto-EncodeR<br> Xudong Xu, Zhaoyang Lyu, Xingang Pan, Bo Dai<br> arxiv 2023 [Paper]

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis<br> Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, Guosheng Lin<br> arxiv 2023 [Paper]

AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose<br> Huichao Zhang, Bowen Chen, Hao Yang, Liao Qu, Xu Wang, Li Chen, Chao Long, Feida Zhu, Kang Du, Min Zheng<br> arxiv 2023 [Paper] [Project]

Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions<br> Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, Angjoo Kanazawa<br> ICCV 2023 [Paper] [Project] [Code]

FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields<br> Sungwon Hwang, Junha Hyung, Daejin Kim, Min-Jung Kim, Jaegul Choo<br> ICCV 2023 [Paper]

Local 3D Editing via 3D Distillation of CLIP Knowledge<br> Junha Hyung, Sungwon Hwang, Daejin Kim, Hyunji Lee, Jaegul Choo<br> CVPR 2023 [Paper]

RePaint-NeRF: NeRF Editting via Semantic Masks and Diffusion Models<br> Xingchen Zhou, Ying He, F. Richard Yu, Jianqiang Li, You Li<br> IJCAI 2023 [Paper]

DreamTime: An Improved Optimization Strategy for Text-to-3D Content Creation<br> Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-Jun Zha, Lei Zhang<br> arxiv 2023 [Paper] [Project]

AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars<br> Mohit Mendiratta, Xingang Pan, Mohamed Elgharib, Kartik Teotia, Mallikarjun B R, Ayush Tewari, Vladislav Golyanik, Adam Kortylewski, Christian Theobalt<br> arxiv 2023 [Paper] [Project]

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields<br> Ori Gordon, Omri Avrahami, Dani Lischinski<br> arxiv 2023 [Paper] [Project]

OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation with Neural Radiance Fields<br> Youtan Yin, Zhoujie Fu, Fan Yang, Guosheng Lin<br> arxiv 2023 [Paper] [Project] [Code]

HiFA: High-fidelity Text-to-3D with Advanced Diffusion Guidance<br> Junzhe Zhu, Peiye Zhuang<br> arxiv 2023 [Paper] [Project]

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation<br> Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu<br> arxiv 2023 [Paper] [Project]

Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields<br> Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, Jing Liao<br> arxiv 2023 [Paper] [Project]

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models<br> Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong<br> arxiv 2023 [Paper] [Project]

DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model<br> Hoigi Seo, Hayeon Kim, Gwanghyun Kim, Se Young Chun<br> arxiv 2023 [Paper] [Project] [Code]

CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout<br> Yiqi Lin, Haotian Bai, Sijia Li, Haonan Lu, Xiaodong Lin, Hui Xiong, Lin Wang<br> arxiv 2023 [Paper]

Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes<br> Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja Giryes, Daniel Cohen-Or<br> arxiv 2023 [Paper] [Project] [Code]

Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation<br> Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, Seungryong Kim<br> arxiv 2023 [Paper] [Project] [Code]

Text-To-4D Dynamic Scene Generation<br> Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, Yaniv Taigman<br> arxiv 2023 [Paper] [Project]

Magic3D: High-Resolution Text-to-3D Content Creation<br> Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, Tsung-Yi Lin<br> CVPR 2023 [Paper] [Project]

DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model<br> Gwanghyun Kim, Se Young Chun<br> CVPR 2023 [Paper] [Code] [Project]

Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models<br> Gang Li, Heliang Zheng, Chaoyue Wang, Chang Li, Changwen Zheng, Dacheng Tao<br> arxiv 2022 [Paper] [Project]

DreamFusion: Text-to-3D using 2D Diffusion<br> Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall<br> arxiv 2022 [Paper] [Project]

Zero-Shot Text-Guided Object Generation with Dream Fields<br> Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, Ben Poole<br> CVPR 2022 [Paper] [Code] [Project]

IDE-3D: Interactive Disentangled Editing for High-Resolution 3D-aware Portrait Synthesis<br> Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue Wang, Yebin Liu<br> SIGGRAPH Asia 2022 [Paper] [Code] [Project]

Sem2NeRF: Converting Single-View Semantic Masks to Neural Radiance Fields<br> Yuedong Chen, Qianyi Wu, Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai<br> arxiv 2022 [Paper] [Code] [Project]

CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields<br> Can Wang, Menglei Chai, Mingming He, Dongdong Chen, Jing Liao<br> CVPR 2022 [Paper] [Code] [Project]

CG-NeRF: Conditional Generative Neural Radiance Fields<br> Kyungmin Jo, Gyumin Shim, Sanghun Jung, Soyoung Yang, Jaegul Choo<br> arxiv 2021 [Paper]

Zero-Shot Text-Guided Object Generation with Dream Fields<br> Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, Ben Poole<br> arxiv 2021 [Paper] [Project]

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis<br> Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, Juyong Zhang<br> ICCV 2021 [Paper] [Code] [Project] [Video]

<br>

Diffusion-based-Methods

BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing<br> Dongxu Li, Junnan Li, Steven C.H. Hoi<br> Arxiv 2023 [Paper] [Project] [Code]

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions<br> Qian Wang, Biao Zhang, Michael Birsak, Peter Wonka<br> Arxiv 2023 [Paper] [Project] [Code]

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation<br> Nataniel Ruiz, Yuanzhen Li, Varun Jampani Yael, Pritch Michael, Rubinstein Kfir Aberman<br> CVPR 2023 [Paper] [Project] [Code]

Multi-Concept Customization of Text-to-Image Diffusion<br> Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, Jun-Yan Zhu<br> CVPR 2023 [Paper] [Project] [Code]

Collaborative Diffusion for Multi-Modal Face Generation and Editing<br> Ziqi Huang, Kelvin C.K. Chan, Yuming Jiang, Ziwei Liu<br> CVPR 2023 [Paper] [Project] [Code]

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation<br> Narek Tumanyan, Michal Geyer, Shai Bagon, Tali Dekel<br> CVPR 2023 [Paper] [Project] [Code]

SINE: SINgle Image Editing with Text-to-Image Diffusion Models<br> Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris Metaxas, Jian Ren<br> CVPR 2023 [Paper] [Project] [Code]

NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models<br> Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, Daniel Cohen-Or<br> CVPR 2023 [Paper] [Project] [Code]

Paint by Example: Exemplar-Based Image Editing With Diffusion Models<br> Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen<br> CVPR 2023 [Paper] [Demo] [Code]

SpaText: Spatio-Textual Representation for Controllable Image Generation<br> Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, Xi Yin<br> CVPR 2023 [Paper] [Project]

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models<br> Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis<br> CVPR 2023 [Paper] [Project]

InstructPix2Pix Learning to Follow Image Editing Instructions<br> Tim Brooks, Aleksander Holynski, Alexei A. Efros<br> CVPR 2023 [Paper] [Project] [Code]

Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion Models<br> Nithin Gopalakrishnan Nair, Chaminda Bandara, Vishal M Patel<br> CVPR 2023 [Paper] [Project] [Code]

DiffEdit: Diffusion-based semantic image editing with mask guidance<br> Guillaume Couairon, Jakob Verbeek, Holger Schwenk, Matthieu Cord<br> CVPR 2023 [Paper]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers<br> Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu<br> Arxiv 2022 [Paper] [Project]

Prompt-to-Prompt Image Editing with Cross-Attention Control<br> Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman1 Yael Pritch, Daniel Cohen-Or<br> Arxiv 2022 [Paper] [Project] [Code]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion<br> Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or<br> Arxiv 2022 [Paper] [Project] [Code]

Text2Human: Text-Driven Controllable Human Image Generation<br> Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, Ziwei Liu<br> SIGGRAPH 2022 [Paper] [Project] [Code]

[DALL-E 2] Hierarchical Text-Conditional Image Generation with CLIP Latents<br> Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen<br> [Paper] [Code]

High-Resolution Image Synthesis with Latent Diffusion Models<br> Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer<br> CVPR 2022 [Paper] [Code]

v objective diffusion<br> Katherine Crowson<br> [Code]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models<br> Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen<br> arxiv 2021 [Paper] [Code]

Vector Quantized Diffusion Model for Text-to-Image Synthesis<br> Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo<br> arxiv 2021 [Paper] [Code]

DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation<br> Gwanghyun Kim, Jong Chul Ye<br> arxiv 2021 [Paper]

Blended Diffusion for Text-driven Editing of Natural Images<br> Omri Avrahami, Dani Lischinski, Ohad Fried<br> CVPR 2022 [Paper] [Project] [Code]

<br>

Autoregressive-Methods

MaskGIT: Masked Generative Image Transformer<br> Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman<br> arxiv 2022 [Paper]

<!-- [[Project](https://wenxin.baidu.com/wenxin/ernie-vilg)] -->

ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation<br> Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua Wu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang<br> arxiv 2021 [Paper] [Project]

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion<br> Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan<br> arxiv 2021 [Paper] [Code] [Video]

L-Verse: Bidirectional Generation Between Image and Text<br> Taehoon Kim, Gwangmo Song, Sihaeng Lee, Sangyun Kim, Yewon Seo, Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae<br> arxiv 2021 [Paper] [Code]

<!-- [[Video](https://youtu.be/C9CTnZJ9ZE0)] -->

M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis<br> Zhu Zhang, Jianxin Ma, Chang Zhou, Rui Men, Zhikang Li, Ming Ding, Jie Tang, Jingren Zhou, Hongxia Yang<br> NeurIPS 2021 [Paper]

<!-- [[Project](https://compvis.github.io/imagebart/)] -->

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis<br> Patrick Esser, Robin Rombach, Andreas Blattmann, Björn Ommer<br> NeurIPS 2021 [Paper] [Code] [Project]

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation<br> Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu<br> ACM MM 2021 [Paper] [Code]

Unifying Multimodal Transformer for Bi-directional Image and Text Generation<br> Yupan Huang, Hongwei Xue, Bei Liu, Yutong Lu<br> ACM MM 2021 [Paper] [Code]

Taming Transformers for High-Resolution Image Synthesis<br> Patrick Esser, Robin Rombach, Björn Ommer<br> CVPR 2021 [Paper] [Code] [Project]

RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP<br> Alex Shonenkov and Michael Konstantinov<br> arxiv 2022 [Code]

Generate Images from Texts in Russian (ruDALL-E)<br> [Code] [Project]

Zero-Shot Text-to-Image Generation<br> Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever<br> arxiv 2021 [Paper] [Code] [Project]

Compositional Transformers for Scene Generation<br> Drew A. Hudson, C. Lawrence Zitnick<br> NeurIPS 2021 [Paper] [Code]

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers<br> Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi<br> EMNLP 2020 [Paper] [Code]

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning<br> Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu<br> AAAI 2022 [Paper]

<br>

Image-Quantizer

[TE-VQGAN] Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation<br> Woncheol Shin, Gyubok Lee, Jiyoung Lee, Joonseok Lee, Edward Choi<br> arxiv 2021 [Paper] [Code]

[ViT-VQGAN] Vector-quantized Image Modeling with Improved VQGAN<br> Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu<br> arxiv 2021 [Paper]

<!-- [[Code](https://github.com/CompVis/taming-transformers)] -->

[PeCo] PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers<br> Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu<br> arxiv 2021 [Paper]

<!-- [[Code](https://github.com/CompVis/taming-transformers)] -->

[VQ-GAN] Taming Transformers for High-Resolution Image Synthesis<br> Patrick Esser, Robin Rombach, Björn Ommer<br> CVPR 2021 [Paper] [Code]

[Gumbel-VQ] vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations<br> Alexei Baevski, Steffen Schneider, Michael Auli<br> ICLR 2020 [Paper] [Code]

[EM VQ-VAE] Theory and Experiments on Vector Quantized Autoencoders<br> Aurko Roy, Ashish Vaswani, Arvind Neelakantan, Niki Parmar<br> arxiv 2018 [Paper] [Code]

[VQ-VAE] Neural Discrete Representation Learning<br> Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu<br> NIPS 2017 [Paper] [Code]

[VQ-VAE2 or EMA-VQ] Generating Diverse High-Fidelity Images with VQ-VAE-2<br> Ali Razavi, Aaron van den Oord, Oriol Vinyals<br> NIPS 2019 [Paper] [Code]

[Discrete VAE] Discrete Variational Autoencoders<br> Jason Tyler Rolfe<br> ICLR 2017 [Paper] [Code]

[DVAE++] DVAE++: Discrete Variational Autoencoders with Overlapping Transformations<br> Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, Evgeny Andriyash<br> ICML 2018 [Paper] [Code]

[DVAE#] DVAE#: Discrete Variational Autoencoders with Relaxed Boltzmann Priors<br> Arash Vahdat, Evgeny Andriyash, William G. Macready<br> NIPS 2018 [Paper] [Code]

<br>

GAN-based-Methods

GauGAN2<br> NVIDIA<br> [Project] [Video]

Multimodal Conditional Image Synthesis with Product-of-Experts GANs<br> Xun Huang, Arun Mallya, Ting-Chun Wang, Ming-Yu Liu<br> arxiv 2021 [Paper]

RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis from Constrained Prior Knowledge<br> Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, Dapeng Tao<br> TCSVT 2021 [Paper]

TRGAN: Text to Image Generation Through Optimizing Initial Image<br> Liang Zhao, Xinwei Li, Pingda Huang, Zhikui Chen, Yanqi Dai, Tianyu Li<br> ICONIP 2021 [Paper]

<!-- **Image Synthesis From Layout With Locality-Aware Mask Adaption [Layout2Image]**<br> *Zejian Li, Jingyu Wu, Immanuel Koh, Yongchuan Tang, Lingyun Sun*<br> GCPR 2021 [[Paper](https://arxiv.org/pdf/2103.13722.pdf)] [[Code](https://github.com/stanifrolov/AttrLostGAN)] **AttrLostGAN: Attribute Controlled Image Synthesis from Reconfigurable Layout and Style [Layout2Image]**<br> *Stanislav Frolov, Avneesh Sharma, Jörn Hees, Tushar Karayil, Federico Raue, Andreas Dengel*<br> ICCV 2021 [[Paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Li_Image_Synthesis_From_Layout_With_Locality-Aware_Mask_Adaption_ICCV_2021_paper.pdf)] -->

Audio-Driven Emotional Video Portraits [Audio2Image]<br> Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, Feng Xu<br> CVPR 2021 [Paper] [Code] [Project]

SketchyCOCO: Image Generation from Freehand Scene Sketches<br> Chengying Gao, Qi Liu, Qi Xu, Limin Wang, Jianzhuang Liu, Changqing Zou<br> CVPR 2020 [Paper] [Code] [Project]

Direct Speech-to-Image Translation [Audio2Image]<br> Jiguo Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, Wen Gao<br> JSTSP 2020 [Paper] [Code] [Project]

MirrorGAN: Learning Text-to-image Generation by Redescription [Text2Image]<br> Tingting Qiao, Jing Zhang, Duanqing Xu, Dacheng Tao<br> CVPR 2019 [Paper] [Code]

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks [Text2Image]<br> Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He<br> CVPR 2018 [Paper] [Code]

Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space<br> Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, Jason Yosinski<br> CVPR 2017 [Paper] [Code]

StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks [Text2Image]<br> Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas<br> TPAMI 2018 [Paper] [Code]

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks [Text2Image]<br> Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas<br> ICCV 2017 [Paper] [Code]

<br>

GAN-Inversion-Methods

Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold<br> Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, Christian Theobalt<br> SIGGRAPH 2023 [Paper] [Code]

HairCLIP: Design Your Hair by Text and Reference Image<br> Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Zhentao Tan, Lu Yuan, Weiming Zhang, Nenghai Yu<br> arxiv 2021 [Paper] [Code]

FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+ GAN Space Optimization<br> Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, Qiang Liu<br> arxiv 2021 [Paper] [Code]

StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation<br> Umut Kocasari, Alara Dirik, Mert Tiftikci, Pinar Yanardag<br> WACV 2022 [Paper] [Code] [Project]

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis<br> Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao<br> ACM MM 2021 [Paper]

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery<br> Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, Dani Lischinski<br> ICCV 2021 [Paper] [Code] [Video]

Talk-to-Edit: Fine-Grained Facial Editing via Dialog<br> Yuming Jiang, Ziqi Huang, Xingang Pan, Chen Change Loy, Ziwei Liu<br> ICCV 2021 [Paper] [Code] [Project]

TediGAN: Text-Guided Diverse Face Image Generation and Manipulation<br> Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu<br> CVPR 2021 [Paper] [Code] [Video]

Paint by Word<br> David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, Antonio Torralba<br> arxiv 2021 [Paper]

<br>

Other-Methods

Language-Driven Image Style Transfer<br> Tsu-Jui Fu, Xin Eric Wang, William Yang Wang<br> arxiv 2021 [Paper]

CLIPstyler: Image Style Transfer with a Single Text Condition<br> Gihyun Kwon, Jong Chul Ye<br> arxiv 2021 [Paper] [Code]

Wakey-Wakey: Animate Text by Mimicking Characters in a GIF<br> Liwenhan Xie, Zhaoyu Zhou, Kerun Yu, Yun Wang, Huamin Qu, Siming Chen<br> UIST 2023 [Paper] [Code] [Project]

<br> <br>

Text-Encoding

FLAVA: A Foundational Language And Vision Alignment Model<br> Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela<br> arxiv 2021 [Paper]

<!-- [[Code](https://github.com/paper11667/CLIPstyler)] -->

Learning Transferable Visual Models From Natural Language Supervision (CLIP)<br> Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever<br> arxiv 2021 [Paper] [Code]

<br>

Audio-Encoding

Wav2CLIP: Learning Robust Audio Representations From CLIP (Wav2CLIP)<br> Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, Juan Pablo Bello<br> ICASSP 2022 [Paper] [Code]

Datasets

Multimodal CelebA-HQ (https://github.com/IIGROUP/MM-CelebA-HQ-Dataset)

DeepFashion MultiModal (https://github.com/yumingj/DeepFashion-MultiModal)

Citation

If you use this code for your research, please cite our papers.

@inproceedings{zhan2023mise,
  title={Multimodal Image Synthesis and Editing: The Generative AI Era},
  author={Zhan, Fangneng and Yu, Yingchen and Wu, Rongliang and Zhang, Jiahui and Lu, Shijian and Liu, Lingjie and Kortylewski, Adam and Theobalt, Christian and Xing, Eric},
  booktitle={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2023},
  publisher={IEEE}
}