Home

Awesome

Awesome PR's Welcome <br />

<p align="center"> <h1 align="center">A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models</h1> <p align="center"> <!-- arXiv, 2024 --> <!-- <br /> --> <a href="https://github.com/xinchengshuai"><strong>Xincheng Shuai</strong></a> · <a href="https://henghuiding.github.io/"><strong>Henghui Ding</strong></a> · <a href="http://xingjunma.com/"><strong>Xingjun Ma</strong></a> · <a href="https://rongchengtu1.github.io/"><strong>Rongcheng Tu</strong></a> · <a href="https://scholar.google.com/citations?user=f3_FP8AAAAAJ&hl=en"><strong>Yu-Gang Jiang</strong></a> · <a href="https://scholar.google.com/citations?user=RwlJNLcAAAAJ"><strong>Dacheng Tao</strong></a> · </p> <p align="center"> <a href='https://arxiv.org/abs/2406.14555'> <img src='https://img.shields.io/badge/Paper-PDF-green?style=flat&logo=arXiv&' alt='arXiv PDF'> </a> <!-- <a href='' style='padding-left: 0.5rem;'> <img src='https://img.shields.io/badge/Project-Page-blue?style=flat&logo=Google%20chrome&logoColor=blue' alt='S-Lab Project Page'> </a> --> </p> <br />

This repo is used for recording and tracking recent multimodal-guided image editing methods with T2I models, as a supplement to our survey.
If you find any work missing or have any suggestions, feel free to pull requests. We will add the missing papers to this repo ASAP.

<!-- We categorize the reviewed papers by their editing scenario, and illustrate their inversion and editing algorithms. -->

🔥News

[1] We have uploaded our evaluation dataset!!

🔥Highlight!!

[1] Two concurrent works (Huang et al., Cao et al.) are related to our survey. Huang et al. introduce the application of diffusion models in image editing, while Cao et al. focus on the controllable image generation. Compared to the review from Huang et al. and other previous literature, we investigate the image editing in a more general context. Our discussion extends beyond low-level semantics, encompassing customization tasks that align with our topic. We integrate existing general editing methods into a unified framework and provide a design space for users through qualitative and quantitative analyses.

[2] In this repo, we organize the reviewed methods based on the editing task and present their inversion & editing algorithms along with guidance set. It's worth noting that many of these studies employ multiple editing algorithms simultaneously. For simplicity, we have currently only indicate the primary technology they use.

[3] We hope our work will assist researchers in exploring novel combinations within our framework, thereby enhancing performance in challenging scenarios.

Editing Tasks Discussed in Our Survey

<p align="center"> <img src="./images/editing_task.jpg" alt="image" style="width:800px;"> </p>

Unified Framework

<p align="center"> <img src="./images/unified_framework.jpg" alt="image" style="width:800px;"> </p>

Notation

Inversion Algorithm:

Editing Algorithm:

Table of contents

Content-Aware Editing

<br>

Content-Free Editing

<br>

Experiment and Data

<br>

Object and Attribute Manipulation

1. Training-Free Approaches

PublicationPaper TitleGuidance SetCombinationCode/Project
TOG 2023UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Imagetext$F_{inv}^T+F_{edit}^{Norm}$Code
CVPR 2024Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulationinstruction$F_{inv}^T+F_{edit}^{Attn}$Code
CVPR 2023Imagic: Text-Based Real Image Editing with Diffusion Modelstext$F_{inv}^T+F_{edit}^{Blend}$Code
Arxiv 2023Forgedit: Text Guided Image Editing via Learning and Forgettingtext$F_{inv}^T+F_{edit}^{Blend}$Code
CVPR 2024Doubly Abductive Counterfactual Inference for Text-based Image Editingtext$F_{inv}^T+F_{edit}^{Blend}$Code
CVPR 2024ZONE: Zero-Shot Instruction-Guided Local Editinginstruction$F_{inv}^T+F_{edit}^{Blend}$Code
CVPR 2023SINE: Sinle Image Editing with Text-to-Image Diffusion Modelstext$F_{inv}^T+F_{edit}^{Score}$Code
CVPR 2023EDICT: Exact Diffusion Inversion via Coupled Transformationstext$F_{inv}^F+F_{edit}^{Norm}$Code
Arxiv 2023Exact Diffusion Inversion via Bi-directional Integration Approximationtext$F_{inv}^F+F_{edit}^{Norm}$Code
CVPR 2023Null-text Inversion for Editing Real Images using Guided Diffusion Modelstext$F_{inv}^F+F_{edit}^{Attn}$Code
Arxiv 2023Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Modelstext$F_{inv}^F+F_{edit}^{Attn}$Code
Arxiv 2023Fixed-point Inversion for Text-to-image diffusion modelstext$F_{inv}^F+F_{edit}^{Attn}$Code
NeurIPS 2023Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editingtext$F_{inv}^F+F_{edit}^{Attn}$Code
ICLR 2023Prompt-to-Prompt Image Editing with Cross-Attention Controltext$F_{inv}^F+F_{edit}^{Attn}$Code
CVPR 2023Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translationtext$F_{inv}^F+F_{edit}^{Attn}$Code
Arxiv 2023StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editingtext$F_{inv}^F+F_{edit}^{Attn}$Code
WACV 2024ProxEdit: Improving Tuning-Free Real Image Editing with Proximal Guidancetext$F_{inv}^F+F_{edit}^{Attn}$Code
ICLR 2024PnP Inversion: Boosting Diffusion-based Editing with 3 Lines of Codetext$F_{inv}^F+F_{edit}^{Attn}$Code
CVPR 2024An Edit Friendly DDPM Noise Space: Inversion and Manipulationstext$F_{inv}^F+F_{edit}^{Attn}$Code
CVPR 2024Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editingtext$F_{inv}^F+F_{edit}^{Attn}$Code
ICCV 2023Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Modelstext$F_{inv}^F+F_{edit}^{Blend}$Code
ICLR 2023DiffEdit: Diffusion-based semantic image editing with mask guidancetext$F_{inv}^F+F_{edit}^{Blend}$Code
Arxiv 2023PFB-Diff: Progressive Feature Blending Diffusion for Text-driven Image Editingtext$F_{inv}^F+F_{edit}^{Blend}$Code
CVPR 2023Uncovering the Disentanglement Capability in Text-to-Image Diffusion Modelstext$F_{inv}^F+F_{edit}^{Blend}$Code
ICLR 2024Object-aware Inversion and Reassembly for Image Editingtext$F_{inv}^F+F_{edit}^{Blend}$Code
Arxiv 2022The Stable Artist: Steering Semantics in Diffusion Latent Spacetext$F_{inv}^F+F_{edit}^{Score}$Code
SIGGRAPH 2023Zero-shot Image-to-Image Translationtext$F_{inv}^F+F_{edit}^{Score}$Code
NeurIPS 2023SEGA: Instructing Diffusion using Semantic Dimensionstext$F_{inv}^F+F_{edit}^{Score}$Code
ICCV 2023Effective Real Image Editing with Accelerated Iterative Diffusion Inversiontext$F_{inv}^F+F_{edit}^{Score}$Code
Arxiv 2023LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidancetext$F_{inv}^F+F_{edit}^{Score}$Code
ICLR 2024Noise Map Guidance: Inversion with Spatial Context for Real Image Editingtext$F_{inv}^F+F_{edit}^{Score}$Code
CVPR 2024LEDITS++: Limitless Image Editing using Text-to-Image Modelstext$F_{inv}^F+F_{edit}^{Score}$Code
ICLR 2024Noise Map Guidance: Inversion with Spatial Context for Real Image Editingtext$F_{inv}^F+F_{edit}^{Score}$Code
ICLR 2024Magicremover: Tuning-free Text-guided Image inpainting with Diffusion Modelstext$F_{inv}^F+F_{edit}^{Score}$Code
Arxiv 2023Region-Aware Diffusion for Zero-shot Text-driven Image Editingtext$F_{inv}^F+F_{edit}^{Optim}$Code
ICCV 2023Delta Denoising Scoretext$F_{inv}^F+F_{edit}^{Optim}$Code
CVPR 2024Contrastive Denoising Score for Text-guided Latent Diffusion Image Editingtext$F_{inv}^F+F_{edit}^{Optim}$Code
Arxiv 2024Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editingtext + mask$F_{inv}^F+F_{edit}^{Optim}$Code
NeurIPS 2024Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Modelstext$F_{inv}^F+F_{edit}^{Optim}$Code
CVPR 2023Custom-Edit: Text-Guided Image Editing with Customized Diffusion Modelstext + image$F_{inv}^T+F_{inv}^F+F_{edit}^{Attn}$Code
NeurIPS 2023Photoswap: Personalized Subject Swapping in Imagestext + image$F_{inv}^T+F_{inv}^F+F_{edit}^{Attn}$Code
TMLR 2023DreamEdit: Subject-driven Image Editingtext + image$F_{inv}^T+F_{inv}^F+F_{edit}^{Blend}$Code

2. Training-Based Approaches

PublicationPaper TitleGuidance SetCode/Project
CVPR 2023InstructPix2Pix: Learning to Follow Image Editing InstructionsinstructionCode
NeurIPS 2023MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image EditinginstructionCode
Arxiv 2023HIVE: Harnessing Human Feedback for Instructional Visual EditinginstructionCode
Arxiv 2023Emu Edit: Precise Image Editing via Recognition and Generation TasksinstructionCode
ICLR 2024Guiding Instruction-Based Image Editing via Multimodal Large Language ModelsinstructionCode
CVPR 2024SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language ModelsinstructionCode
CVPR 2024Referring Image Editing: Object-level Image Editing via Referring ExpressionsinstructionCode
Arxiv 2024EditWorld: Simulating World Dynamics for Instruction-Following Image EditinginstructionCode
<br>

Attribute Manipulation:

1. Training-Free Approaches

PublicationPaper TitleGuidance SetCombinationCode/Project
PRCV 2023KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editingtext$F_{inv}^F+F_{edit}^{Attn}$Code
ICCV 2023Localizing Object-level Shape Variations with Text-to-Image Diffusion Modelstext$F_{inv}^F+F_{edit}^{Attn}$Code
ICCV 2023MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editingtext$F_{inv}^F+F_{edit}^{Attn}$Code
AAAI 2023Tuning-Free Inversion-Enhanced Control for Consistent Image Editingtext$F_{inv}^F+F_{edit}^{Attn}$Code
SIGGRAPH 2024Cross-Image Attention for Zero-Shot Appearance Transferimage$F_{inv}^F+F_{edit}^{Attn}$Code
<!-- ### 2. Training-Based Approaches <br> -->

Spatial Transformation:

1. Training-Free Approaches

PublicationPaper TitleGuidance SetCombinationCode/Project
Arxiv 2024DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editinguser interface$F_{inv}^F+F_{edit}^{Blend}$Code
NeurIPS 2023Diffusion Self-Guidance for Controllable Image Generationtext + image + user interface$F_{inv}^F+F_{edit}^{Score}$Code
ICLR 2024DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Modelsimage + user interface$F_{inv}^F+F_{edit}^{Score}$Code
ICLR 2024DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editingmask + user interface$F_{inv}^T+F_{inv}^F+F_{edit}^{Optim}$Code
ICLR 2024DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editingimage + user interface$F_{inv}^T+F_{inv}^F+F_{edit}^{Score}$Code
<br> <!-- ### 2. Training-Based Approaches -->

Inpainting:

1. Training-Free Approaches

PublicationPaper TitleGuidance SetCombinationCode/Project
Arxiv 2023HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Modelstext + mask$F_{inv}^T+F_{edit}^{Attn}$Code
ICCV 2023TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Compositiontext + image$F_{inv}^F+F_{edit}^{Attn}$Code
TOG 2023Blended Latent Diffusiontext + mask$F_{inv}^F+F_{edit}^{Blend}$Code
Arxiv 2023High-Resolution Image Editing via Multi-Stage Blended Diffusiontext + mask$F_{inv}^F+F_{edit}^{Blend}$Code
Arxiv 2023Differential Diffusion: Giving Each Pixel Its Strengthtext + mask$F_{inv}^F+F_{edit}^{Blend}$Code
CVPR 2024Tuning-Free Image Customization with Image and Text Guidancetext + image + mask$F_{inv}^F+F_{edit}^{Blend}$Code
TMLR 2023DreamEdit: Subject-driven Image Editingtext + image +mask$F_{inv}^T+F_{inv}^F+F_{edit}^{Blend}$Code)

2. Training-Based Approaches

PublicationPaper TitleGuidance SetCode/Project
CVPR 2024Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpaintingtext + maskCode
CVPR 2023SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Modeltext + maskCode
Arxiv 2023A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpaintingtext + maskCode
CVPR 2023Paint by Example: Exemplar-based Image Editing with Diffusion Modelsimage + maskCode
CVPR 2023ObjectStitch: Object Compositing with Diffusion Modelimage + maskCode
CVPR 2023Reference-based Image Composition with Sketch via Structure-aware Diffusion Modelimage + maskCode
ICASSP 2024Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Modeltext+ image + maskCode
CVPR 2024AnyDoor: Zero-shot Object-level Image Customizationimage + maskCode
<br>

Style Change:

1. Training-Free Approaches

PublicationPaper TitleGuidance SetCombinationCode/Project
CVPR 2023Inversion-Based Style Transfer with Diffusion Modelstext + image$F_{inv}^T+F_{inv}^F+F_{edit}^{Norm}$Code
Arxiv 2023Z∗: Zero-shot Style Transfer via Attention Rearrangementimage$F_{inv}^F+F_{edit}^{Attn}$Code
CVPR 2024Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transferimage$F_{inv}^F+F_{edit}^{Attn}$Code
<!-- ### 2. Training-Based Approaches <br> -->

Image Translation:

1. Training-Free Approaches

PublicationPaper TitleGuidance SetCombinationCode/Project
CVPR 2024FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Conditiontext$F_{inv}^F+F_{edit}^{Score}$Code

2. Training-Based Approaches

PublicationPaper TitleGuidance SetCode/Project
ICCV 2023Adding Conditional Control to Text-to-Image Diffusion ModelstextCode
NeurIPS 2023Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image GenerationtextCode
NeurIPS 2023Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion ModeltextCode
NeurIPS 2023CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image ManipulationtextCode
AAAI 2024T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion ModelstextCode
CVPR 2024SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection EditingtextCode
Arxiv 2024One-Step Image Translation with Text-to-Image ModelstextCode
<br>

Subject-Driven Customization:

1. Training-Free Approaches

PublicationPaper TitleGuidance SetCombinationCode/Project
ICLR 2023An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversiontext$F_{inv}^T+F_{edit}^{Norm}$Code
Arxiv 2022DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Positive-Negative Prompt-Tuningtext$F_{inv}^T+F_{edit}^{Norm}$Code
Arxiv 2023Highly Personalized Text Embedding for Image Manipulation by Stable Diffusiontext$F_{inv}^T+F_{edit}^{Norm}$Code
Arxiv 2023P+: Extended Textual Conditioning in Text-to-Image Generationtext$F_{inv}^T+F_{edit}^{Norm}$Code
TOG 2023A Neural Space-Time Representation for Text-to-Image Personalizationtext$F_{inv}^T+F_{edit}^{Norm}$Code
CVPR 2023DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generationtext$F_{inv}^T+F_{edit}^{Norm}$Code
CVPR 2023Multi-Concept Customization of Text-to-Image Diffusiontext$F_{inv}^T+F_{edit}^{Norm}$Code
ICML 2023Cones: Concept Neurons in Diffusion Models for Customized Generationtext$F_{inv}^T+F_{edit}^{Norm}$Code
ICCV 2023SVDiff: Compact Parameter Space for Diffusion Fine-Tuningtext$F_{inv}^T+F_{edit}^{Norm}$Code
Low-Rank Adaptation for Fast Text-to-Image Diffusion Fine-Tuningtext$F_{inv}^T+F_{edit}^{Norm}$Code
Arxiv 2023A Closer Look at Parameter-Efficient Tuning in Diffusion Modelstext$F_{inv}^T+F_{edit}^{Norm}$Code
SIGGRAPH 2023Break-a-scene: Extracting multiple concepts from a single imagetext$F_{inv}^T+F_{edit}^{Norm}$Code
Arxiv 2023Clic: Concept Learning in Contexttext$F_{inv}^T+F_{edit}^{Norm}$Code
Arxiv 2023Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generationtext$F_{inv}^T+F_{edit}^{Norm}$Code
AAAI 2024Decoupled Textual Embeddings for Customized Image Generationtext$F_{inv}^T+F_{edit}^{Norm}$Code
ICLR 2024A Data Perspective on Enhanced Identity Preservation for Diffusion Personalizationtext$F_{inv}^T+F_{edit}^{Norm}$Code
CVPR 2024FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generationtext$F_{inv}^T+F_{edit}^{Norm}$Code
Arxiv 2023ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generationtext$F_{inv}^T+F_{edit}^{Attn}$Code
CVPR 2024DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalizationtext$F_{inv}^T+F_{edit}^{Attn}$Code
Arxiv 2024Direct Consistency Optimization for Compositional Text-to-Image Personalizationtext$F_{inv}^T+F_{edit}^{Score}$Code
Arxiv 2024Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalizationtext$F_{inv}^F+F_{edit}^{Optim}$Code
<!-- [Cones2]() | [📖 ] | [Inversion+Editing] | [🌐 Code]() --> <!-- [CatVersion]() | [📖 ] | [Inversion+Editing] | [🌐 Code]() -->

2. Training-Based Approaches

PublicationPaper TitleGuidance SetCode/Project
Arxiv 2023Encoder-based Domain Tuning for Fast Personalization of Text-to-Image ModelstextCode
Arxiv 2023FastComposer: Tuning-Free Multi-Subject Image Generation with Localized AttentiontextCode
Arxiv 2023PhotoMaker: Customizing Realistic Human Photos via Stacked {ID} EmbeddingtextCode
Arxiv 2023PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion ModelstextCode
ICCV 2023ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image GenerationtextCode
NeurIPS 2023BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and EditingtextCode
SIGGRAPH 2023Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image ModelstextCode
Arxiv 2023IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion ModelstextCode
NeurIPS 2023Subject-driven Text-to-Image Generation via Apprenticeship LearningtextCode
Arxiv 2023Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image GenerationtextCode
Arxiv 2023Subject-Diffusion: Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuningtextCode
Arxiv 2024Instruct-Imagen: Image Generation with Multi-modal InstructioninstructionCode
Arxiv 2024InstantID: Zero-shot Identity-Preserving Generation in SecondstextCode
ICLR 2024Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion ModelstextCode
CVPR 2024InstantBooth: Personalized Text-to-Image Generation without Test-Time FinetuningtextCode
ICLR 2024Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free ApproachtextCode
<br>

Attribute-Driven Customization:

1. Training-Free Approaches

PublicationPaper TitleGuidance SetCombinationCode/Project
Arxiv 2023ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Modelstext$F_{inv}^T+F_{edit}^{Norm}$Code
Arxiv 2023An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesistext$F_{inv}^T+F_{edit}^{Norm}$Code
TOG 2023Concept Decomposition for Visual Exploration and Inspirationtext$F_{inv}^T+F_{edit}^{Norm}$Code
Arxiv 2023ReVersion: Diffusion-Based Relation Inversion from Imagestext$F_{inv}^T+F_{edit}^{Norm}$Code
Arxiv 2023Learning Disentangled Identifiers for Action-Customized Text-to-Image Generationtext$F_{inv}^T+F_{edit}^{Norm}$Code
Arxiv 2023Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Modelstext$F_{inv}^T+F_{edit}^{Norm}$Code
NeurIPS 2023StyleDrop: Text-to-Image Generation in Any Styletext$F_{inv}^T+F_{edit}^{Norm}$Code

2. Training-Based Approaches

PublicationPaper TitleGuidance SetCode/Project
Arxiv 2023ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit AdaptationtextCode
Arxiv 2023DreamCreature: Crafting Photorealistic Virtual Creatures from ImaginationtextCode
ICLR 2024Language-Informed Visual Concept LearningtextCode
Arxiv 2024pOps: Photo-Inspired Diffusion OperatorstextCode

Acknowledgement

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{ImgEditing,
      title={A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models}, 
      author={Shuai, Xincheng and Ding, Henghui and Ma, Xingjun and Tu, Rongcheng and Jiang, Yu-Gang and Tao, Dacheng},
      journal={arXiv},
      year={2024}
}

Contact

henghui.ding[AT]gmail.com