Home

Awesome

Awesome Instruction Editing

A Survey of Instruction-Guided Image and Media Editing in LLM Era

Awesome arXiv GitHub stars Hits <img src="https://img.shields.io/badge/Contributions-Welcome-278ea5" alt="Contrib"/>

A collection of academic articles, published methodology, and datasets on the subject of Instruction-Guided Image and Media Editing.

A sortable version is available here: https://awesome-instruction-editing.github.io/

๐Ÿ”– News!!!

๐Ÿ“Œ We are actively tracking the latest research and welcome contributions to our repository and survey paper. If your studies are relevant, please feel free to create an issue or a pull request.

๐Ÿ“ฐ 2024-11-15: Our paper Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era has been revised into version 1 with new methods and dicussions.

๐Ÿ” Citation

If you find this work helpful in your research, welcome to cite the paper and give a โญ.

Please read and cite our paper: arXiv

Nguyen, T.T., Ren, Z., Pham, T., Huynh, T.T., Nguyen, P.L., Yin, H., and Nguyen, Q.V.H., 2024. Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM Era. arXiv preprint arXiv:2411.09955.

@article{nguyen2024instruction,
  title={Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era},
  author={Thanh Tam Nguyen and Zhao Ren and Trinh Pham and Thanh Trung Huynh and Phi Le Nguyen and Hongzhi Yin and Quoc Viet Hung Nguyen},
  journal={arXiv preprint arXiv:2411.09955},
  year={2024}
}

Existing Surveys

Paper TitleVenueYearFocus
A Survey of Multimodal Composite Editing and RetrievalarXiv2024Media Retrieval
INFOBENCH: Evaluating Instruction Following Ability in Large Language ModelsarXiv2024Text Editing
Multimodal Image Synthesis and Editing: The Generative AI EraTPAMI2023X-to-Image Generation
LLM-driven Instruction Following: Progresses and ConcernsEMNLP2023Text Editing

Pipeline

pipeline


Approaches

TitleYearVenueCategoryCode
Guiding Instruction-based Image Editing via Multimodal Large Language Models2024ICLRLLM-guided, Diffusion, Concise instruction loss, Supervised fine-tuningCode
Hive: Harnessing human feedback for instructional visual editing2024CVPRRLHF, Diffusion, Data augmentationCode
InstructBrush: Learning Attention-based Instruction Optimization for Image Editing2024arXivDiffusion, Attention-basedCode
FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing2024arXivControllable diffusionCode
Pix2Pix-OnTheFly: Leveraging LLMs for Instruction-Guided Image Editing2024arXivon-the-fly, tuning-free, training-freeCode
EffiVED:Efficient Video Editing via Text-instruction Diffusion Models2024arXivVideo editing, decoupled classifier-freeCode
Grounded-Instruct-Pix2Pix: Improving Instruction Based Image Editing with Automatic Target Grounding2024ICASSPDiffusion, mask generation image editingCode
TexFit: Text-Driven Fashion Image Editing with Diffusion Models2024AAAIFashion editing, region locaation, diffusionCode
InstructGIE: Towards Generalizable Image Editing2024arXivDiffusion, context matchingCode
An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control2024arXivFreestyle, Diffusion, Group attentionCode
Text-Driven Image Editing via Learnable Regions2024CVPRRegion generation, diffusion, mask-freeCode
ChartReformer: Natural Language-Driven Chart Image Editing2024ICDARchart editingCode
GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models2024arXivHybrid, direction transferCode
StyleBooth: Image Style Editing with Multimodal Instruction2024arXivstyle editing, diffusionCode
ZONE: Zero-Shot Instruction-Guided Local Editing2024CVPRLocal editing, localisationCode
Inversion-Free Image Editing with Natural Language2024CVPRConsistent models, unified attentionCode
Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation2024CVPRDiffusion, multi-instructionCode
MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers2024arXivMoE, LLM-poweredCode
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists2024ICLRDiffusion, LLM-based, classifier-freeCode
Iterative Multi-Granular Image Editing Using Diffusion Models2024WACVDiffusion, Iterative editing
Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing2024NeurIPSDiffusion, dynamic promptCode
Object-Aware Inversion and Reassembly for Image Editing2024ICLRDiffusion, multi-objectCode
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models2024arXivvideo editing, zero-shotCode
Video-P2P: Video Editing with Cross-attention Control2024CVPRDecoupled-guidance attention control, video editingCode
NeRF-Insert: 3D Local Editing with Multimodal Control Signals2024arXiv3D Editing
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models2024arXiv3D EditingCode
AudioScenic: Audio-Driven Video Scene Editing2024arXivaudio-based instruction
LocInv: Localization-aware Inversion for Text-Guided Image Editing2024CVPR-AI4CCLocalization-aware inversionCode
SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models2024arXivAudio-drivenCode
Exploring Text-Guided Single Image Editing for Remote Sensing Images2024arXivRemote sensing imagesCode
GaussianVTON: 3D Human Virtual Try-ON via Multi-Stage Gaussian Splatting Editing with Image Prompting2024arXivFashion editingCode
TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing2024arXivChain of thought
Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection2024arXivDiffusion, Self-attention InjectionCode
Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning2024arXivMusic editing, diffusionCode
Text Guided Image Editing with Automatic Concept Locating and Forgetting2024arXivDiffusion, concept forgetting
InstructPix2Pix: Learning To Follow Image Editing Instruction2023CVPRCore paper, DiffusionCode
Visual Instruction Inversion: Image Editing via Image Prompting2023NeurIPSDiffusion, visual instructionCode
Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions2023ICCV3D scene editingCode
Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion2023arXiv3D editing, Dynamic scalingCode
InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models2023arXivMusic editing, diffusionCode
EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models2023arXivauthorized editing, diffusionCode
Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis2023arXivVideo editing, cross-time attentionCode
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models2023NeurIPSAudio, DiffusionCode
InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following2023arXivRefinement prior, instrucitonal tuningCode
Learning to Follow Object-Centric Image Editing Instructions Faithfully2023EMNLPDiffusion, additional supervisionCode
StableVideo: Text-driven Consistency-aware Diffusion Video Editing2023ICCVDiffusion, VideoCode
Vox-E: Text-Guided Voxel Editing of 3D Objects2023ICCVDiffusion, 3DCode
FICE: Text-Conditioned Fashion Image Editing With Guided GAN Inversion2023arXivGAN, fashion imagesCode
NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models2023CVPRnull-tex embedding, Diffusion, CLIPCode
Imagic: Text-based real image editing with diffusion models2023CVPRDiffusion, embedding interpolationCode
PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models2023arXivDiffusion, dual-branch conceptCode
InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions2023arXivDiffusion, LLM-poweredCode
Instructdiffusion: A generalist modeling interface for vision tasks2023arXivMulti-task, multi-turn, Diffusion, LLMCode
Emu Edit: Precise Image Editing via Recognition and Generation Tasks2023arXivDiffusion, multi-task, multi-turnCode
Dialogpaint: A dialog-based image editing model2023arXivDialog-based
Inst-Inpaint: Instructing to Remove Objects with Diffusion Models2023arXivScene EditingCode
ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation2023NeurIPSExample-based instruction
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models2023arXivMLLM, DiffusionCode
ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation2023arXivLLM, DiffusionCode
iEdit: Localised Text-guided Image Editing with Weak Supervision2023arXivLocalized diffusion
Prompt-to-Prompt Image Editing with Cross Attention Control2023ICLRDiffusion, Cross AttentionCode
Target-Free Text-Guided Image Manipulation2023AAAI3D EditingCode
Paint by example: Exemplar-based image editing with diffusion models2023CVPRDiffusion, example-basedCode
De-net: Dynamic text-guided image editing adversarial networks2023AAAIGAN, multi-taskCode
Imagen editor and editbench: Advancing and evaluating text-guided image inpainting2023CVPRDiffusion, benchmark, CLIPCode
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation2023CVPRDiffusion, feature injectionCode
MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing2023ICCVDiffusion, mutual self-attentionCode
Unitune: Text-driven image editing by fine tuning a diffusion model on a single image2023TOGDiffusion, fine-tuningCode
Dreamix: Video Diffusion Models are General Video Editors2023arXivCascaded diffusion, videoCode
LDEdit: Towards Generalized Text Guided Image Manipulation via Latent Diffusion Models2022BMVClatent diffusion
StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation2022WACVGAN, CLIPCode
Blended Diffusion for Text-Driven Editing of Natural Images2022CVPRDiffusion, CLIP, BlendCode
VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance2022ECCVGAN, CLIPCode
StyleGAN-NADA: CLIP-guided domain adaptation of image generators2022TOGGAN, CLIPCode
DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation2022CVPRDiffusion, CLIP, Noise combinationCode
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models2022ICMLDiffusion, CLIP, Classifier-free guidanceCode
DiffEdit: Diffusion-based semantic image editing with mask guidance2022ICLRDiffusion, DDIM, Mask generationCode
Text2mesh: Text-driven neural stylization for meshes2022CVPR3D EditingCode
Manitrans: Entity-level text-guided image manipulation via token-wise semantic alignment and generation2022CVPRGAN, multi-entitiesCode
Text2live: Text-driven layered image and video editing2022ECCVGAN, CLIP, Video editingCode
SPEECHPAINTER: TEXT-CONDITIONED SPEECH INPAINTING2022InterspeechSpeech editingCode
Talk-to-Edit: Fine-Grained Facial Editing via Dialog2021ICCVGAN, dialog, semantic fieldCode
Manigan: Text-guided image manipulation2020CVPRGAN, affine combinationCode
SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning2020EMNLPGAN, Cross-task consistencyCode
Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions2020ECCVGANCode
Sequential Attention GAN for Interactive Image Editing2020MMGAN, Dialog, Sequential Attention
Lightweight generative adversarial networks for text-guided image manipulation2020NeurIPSLight-weight GANCode
Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction2019ICCVGANCode
Language-Based Image Editing With Recurrent Attentive Models2018CVPRGAN, Recurrent AttentionCode
Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language2018NeurIPSGAN, simpleCode
FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction2024arXivDiffusion, instruction-driven editingCode
Revealing Directions for Text-guided 3D Face Editing2024arXivText-guided 3D face editing
Vision-guided and Mask-enhanced Adaptive Denoising for Prompt-based Image Editing2024arXivText-to-image, editing, diffusion
Hyper-parameter tuning for text guided image editing2024arXivText EditingCode
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models2024arXivText-guided Object InsertionCode

Other types of Editing

TitleYearVenueCategoryCode
SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing2024SIGGRAPH AsiaDiffusion, scene graph, image-editingCode
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition2024arXivText-to-Audio, Multimodal
AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework2024arXivDiffusion-based text-to-audioCode
Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis2024BMVCDiffusion-based local image manipulationCode
Steer-by-prior Editing of Symbolic Music Loops2024MMLMasked Language Modelling, music instrumentsCode
Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning2024ISMIRDiffusion-based text-to-audioCode
GroupDiff: Diffusion-based Group Portrait Editing2024ECCVDiffusion-based image editingCode
RegionDrag: Fast Region-Based Image Editing with Diffusion Models2024ECCVDiffusion-based image editingCode
SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing2024arXivMulti-view consistency
DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation2024arXivDiffusion-based editingCode
MEDIC: Zero-shot Music Editing with Disentangled Inversion Control2024arXivAudio editing
3DEgo: 3D Editing on the Go!2024ECCVMonocular 3D Scene SynthesisCode
MedEdit: Counterfactual Diffusion-based Image Editing on Brain MRI2024SASHIMIBiomedical editing
FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing2024ECCVImage editing
LEMON: Localized Editing with Mesh Optimization and Neural Shaders2024arXivMesh editing
Diffusion Brush: A Latent Diffusion Model-based Editing Tool for AI-generated Images2024arXivImage editing
Streamlining Image Editing with Layered Diffusion Brushes2024arXivImage editing
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing2024arXivImage Editing DatasetCode
Environment Maps Editing using Inverse Rendering and Adversarial Implicit Functions2024arXivInverse rendering, HDR editing
HairDiffusion: Vivid Multi-Colored Hair Editing via Latent Diffusion2024arXivHair editing, Diffusion models
DiffuMask-Editor: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing to Improve Segmentation Ability2024arXivSynthetic Data Generation
Taming Rectified Flow for Inversion and Editing2024arXivImage InversionCode

Datasets

Type: General

Dataset#Items#Papers UsedLink
Reason-Edit12.4M+1Link
MagicBrush10K1Link
InstructPix2Pix500K1Link
EditBench2401Link

Type: Image Captioning

Dataset#Items#Papers UsedLink
Conceptual Captions3.3M1Link
CoSaL22K+1Link
ReferIt19K+1Link
Oxford-102 Flowers8K+1Link
LAION-5B5.85B+1Link
MS-COCO330K2Link
DeepFashion800K2Link
Fashion-IQ77K+1Link
Fashion200k200K1Link
MIT-States63K+1Link
CIRR36K+1Link

Type: ClipArt

Dataset#Items#Papers UsedLink
CoDraw58K+1Link

Type: VQA

Dataset#Items#Papers UsedLink
i-CLEVR70K+1Link

Type: Semantic Segmentation

Dataset#Items#Papers UsedLink
ADE20K27K+1Link

Type: Object Classification

Dataset#Items#Papers UsedLink
Oxford-III-Pets7K+1Link

Type: Depth Estimation

Dataset#Items#Papers UsedLink
NYUv2408K+1Link

Type: Aesthetic-Based Editing

Dataset#Items#Papers UsedLink
Laion-Aesthetics V22.4B+1Link

Type: Dialog-Based Editing

Dataset#Items#Papers UsedLink
CelebA-Dialog202K+1Link
Flickr-Faces-HQ70K2Link

Evaluation Metrics

CategoryEvaluation MetricsFormulaUsage
Perceptual QualityLearned Perceptual Image Patch Similarity (LPIPS)$\text{LPIPS}(x, x') = \sum_l ||\phi_l(x) - \phi_l(x')||^2$Measures perceptual similarity between images, with lower scores indicating higher similarity.
Structural Similarity Index (SSIM)$\text{SSIM}(x, x') = \frac{(2\mu_x\mu_{x'} + C_1)(2\sigma_{xx'} + C_2)}{(\mu_x^2 + \mu_{x'}^2 + C_1)(\sigma_x^2 + \sigma_{x'}^2 + C_2)}$Measures visual similarity based on luminance, contrast, and structure.
Frรฉchet Inception Distance (FID)$\text{FID} = ||\mu_r - \mu_g||^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$Measures the distance between the real and generated image feature distributions.
Inception Score (IS)$\text{IS} = \exp(E_x D_{KL}(p(y|x) || p(y)))$Evaluates image quality and diversity based on label distribution consistency.
Structural IntegrityPeak Signal-to-Noise Ratio (PSNR)$\text{PSNR} = 10 \log_{10} \left( \frac{\text{MAX}^2}{\text{MSE}} \right)$Measures image quality based on pixel-wise errors, with higher values indicating better quality.
Mean Intersection over Union (mIoU)$\text{mIoU} = \frac{1}{N} \sum_{i=1}^{N} \frac{|A_i \cap B_i|}{|A_i \cup B_i|}$Assesses segmentation accuracy by comparing predicted and ground truth masks.
Mask Accuracy$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}}$Evaluates the accuracy of generated masks.
Boundary Adherence$\text{BA} = \frac{|B_{\text{edit}} \cap B_{\text{target}}|}{|B_{\text{target}}|}$Measures how well edits preserve object boundaries.
Semantic AlignmentEdit Consistency$\text{EC} = \frac{1}{N} \sum_{i=1}^{N} 1{E_i = E_{\text{ref}}}$Measures the consistency of edits across similar prompts.
Target Grounding Accuracy$\text{TGA} = \frac{\text{Correct Targets}}{\text{Total Targets}}$Evaluates how well edits align with specified targets in the prompt.
Embedding Space Similarity$\text{CosSim}(v_x, v_{x'}) = \frac{v_x \cdot v_{x'}}{||v_x|| , ||v_{x'}||}$Measures similarity between the edited and reference images in feature space.
Decomposed Requirements Following Ratio (DRFR)$\text{DRFR} = \frac{1}{N} \sum_{i=1}^{N} \frac{\text{Requirements Followed}}{\text{Total Requirements}}$Assesses how closely the model follows decomposed instructions.
User-Based MetricsUser Study RatingsCaptures user feedback through ratings of image quality.
Human Visual Turing Test (HVTT)$\text{HVTT} = \frac{\text{Real Judgements}}{\text{Total Judgements}}$Measures the ability of users to distinguish between real and generated images.
Click-through Rate (CTR)$\text{CTR} = \frac{\text{Clicks}}{\text{Total Impressions}}$Tracks user engagement by measuring image clicks.
Diversity and FidelityEdit Diversity$\text{Diversity} = \frac{1}{N} \sum_{i=1}^{N} D_{KL}(p_i || p_{\text{mean}})$Measures the variability of generated images.
GAN Discriminator Score$\text{GDS} = \frac{1}{N} \sum_{i=1}^N D_{\text{GAN}}(x_i)$Assesses the authenticity of generated images using a GAN discriminator.
Reconstruction Error$\text{RE} = ||x - \hat{x}||$Measures the error between the original and generated images.
Edit Success Rate$\text{ESR} = \frac{\text{Successful Edits}}{\text{Total Edits}}$Quantifies the success of applied edits.
Consistency and CohesionScene Consistency$\text{SC} = \frac{1}{N} \sum_{i=1}^{N} \text{Sim}(I_{\text{edit}}, I_{\text{orig}})$Measures how edits maintain overall scene structure.
Color Consistency$\text{CC} = \frac{1}{N} \sum_{i=1}^{N} \frac{|C_{\text{edit}} \cap C_{\text{orig}}|}{|C_{\text{orig}}|}$Measures color preservation between edited and original regions.
Shape Consistency$\text{ShapeSim} = \frac{1}{N} \sum_{i=1}^{N} \text{IoU}(S_{\text{edit}}, S_{\text{orig}})$Quantifies how well shapes are preserved during edits.
Pose Matching Score$\text{PMS} = \frac{1}{N} \sum_{i=1}^{N} \text{Sim}(\theta_{\text{edit}}, \theta_{\text{orig}})$Assesses pose consistency between original and edited images.
RobustnessNoise Robustness$\text{NR} = \frac{1}{N} \sum_{i=1}^{N} ||x_i - x_{i,\text{noisy}}||$Evaluates model robustness to noise.
Perceptual Quality$\text{PQ} = \frac{1}{N} \sum_{i=1}^{N} \text{Score}(x_i)$A subjective quality metric based on human judgment.

Disclaimer

Feel free to contact us if you have any queries or exciting news. In addition, we welcome all researchers to contribute to this repository and further contribute to the knowledge of this field.

If you have some other related references, please feel free to create a Github issue with the paper information. We will glady update the repos according to your suggestions. (You can also create pull requests, but it might take some time for us to do the merge)

HitCount visitors