Awesome
Awesome Instruction Editing
A Survey of Instruction-Guided Image and Media Editing in LLM Era
<img src="https://img.shields.io/badge/Contributions-Welcome-278ea5" alt="Contrib"/>
A collection of academic articles, published methodology, and datasets on the subject of Instruction-Guided Image and Media Editing.
A sortable version is available here: https://awesome-instruction-editing.github.io/
๐ News!!!
๐ We are actively tracking the latest research and welcome contributions to our repository and survey paper. If your studies are relevant, please feel free to create an issue or a pull request.
๐ฐ 2024-11-15: Our paper Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era has been revised into version 1 with new methods and dicussions.
๐ Citation
If you find this work helpful in your research, welcome to cite the paper and give a โญ.
Please read and cite our paper:
Nguyen, T.T., Ren, Z., Pham, T., Huynh, T.T., Nguyen, P.L., Yin, H., and Nguyen, Q.V.H., 2024. Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM Era. arXiv preprint arXiv:2411.09955.
@article{nguyen2024instruction,
title={Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era},
author={Thanh Tam Nguyen and Zhao Ren and Trinh Pham and Thanh Trung Huynh and Phi Le Nguyen and Hongzhi Yin and Quoc Viet Hung Nguyen},
journal={arXiv preprint arXiv:2411.09955},
year={2024}
}
Existing Surveys
Paper Title | Venue | Year | Focus |
---|---|---|---|
A Survey of Multimodal Composite Editing and Retrieval | arXiv | 2024 | Media Retrieval |
INFOBENCH: Evaluating Instruction Following Ability in Large Language Models | arXiv | 2024 | Text Editing |
Multimodal Image Synthesis and Editing: The Generative AI Era | TPAMI | 2023 | X-to-Image Generation |
LLM-driven Instruction Following: Progresses and Concerns | EMNLP | 2023 | Text Editing |
Pipeline
Approaches
Other types of Editing
Datasets
Type: General
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
Reason-Edit | 12.4M+ | 1 | Link |
MagicBrush | 10K | 1 | Link |
InstructPix2Pix | 500K | 1 | Link |
EditBench | 240 | 1 | Link |
Type: Image Captioning
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
Conceptual Captions | 3.3M | 1 | Link |
CoSaL | 22K+ | 1 | Link |
ReferIt | 19K+ | 1 | Link |
Oxford-102 Flowers | 8K+ | 1 | Link |
LAION-5B | 5.85B+ | 1 | Link |
MS-COCO | 330K | 2 | Link |
DeepFashion | 800K | 2 | Link |
Fashion-IQ | 77K+ | 1 | Link |
Fashion200k | 200K | 1 | Link |
MIT-States | 63K+ | 1 | Link |
CIRR | 36K+ | 1 | Link |
Type: ClipArt
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
CoDraw | 58K+ | 1 | Link |
Type: VQA
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
i-CLEVR | 70K+ | 1 | Link |
Type: Semantic Segmentation
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
ADE20K | 27K+ | 1 | Link |
Type: Object Classification
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
Oxford-III-Pets | 7K+ | 1 | Link |
Type: Depth Estimation
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
NYUv2 | 408K+ | 1 | Link |
Type: Aesthetic-Based Editing
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
Laion-Aesthetics V2 | 2.4B+ | 1 | Link |
Type: Dialog-Based Editing
Dataset | #Items | #Papers Used | Link |
---|---|---|---|
CelebA-Dialog | 202K+ | 1 | Link |
Flickr-Faces-HQ | 70K | 2 | Link |
Evaluation Metrics
Category | Evaluation Metrics | Formula | Usage |
---|---|---|---|
Perceptual Quality | Learned Perceptual Image Patch Similarity (LPIPS) | $\text{LPIPS}(x, x') = \sum_l ||\phi_l(x) - \phi_l(x')||^2$ | Measures perceptual similarity between images, with lower scores indicating higher similarity. |
Structural Similarity Index (SSIM) | $\text{SSIM}(x, x') = \frac{(2\mu_x\mu_{x'} + C_1)(2\sigma_{xx'} + C_2)}{(\mu_x^2 + \mu_{x'}^2 + C_1)(\sigma_x^2 + \sigma_{x'}^2 + C_2)}$ | Measures visual similarity based on luminance, contrast, and structure. | |
Frรฉchet Inception Distance (FID) | $\text{FID} = ||\mu_r - \mu_g||^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$ | Measures the distance between the real and generated image feature distributions. | |
Inception Score (IS) | $\text{IS} = \exp(E_x D_{KL}(p(y|x) || p(y)))$ | Evaluates image quality and diversity based on label distribution consistency. | |
Structural Integrity | Peak Signal-to-Noise Ratio (PSNR) | $\text{PSNR} = 10 \log_{10} \left( \frac{\text{MAX}^2}{\text{MSE}} \right)$ | Measures image quality based on pixel-wise errors, with higher values indicating better quality. |
Mean Intersection over Union (mIoU) | $\text{mIoU} = \frac{1}{N} \sum_{i=1}^{N} \frac{|A_i \cap B_i|}{|A_i \cup B_i|}$ | Assesses segmentation accuracy by comparing predicted and ground truth masks. | |
Mask Accuracy | $\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}}$ | Evaluates the accuracy of generated masks. | |
Boundary Adherence | $\text{BA} = \frac{|B_{\text{edit}} \cap B_{\text{target}}|}{|B_{\text{target}}|}$ | Measures how well edits preserve object boundaries. | |
Semantic Alignment | Edit Consistency | $\text{EC} = \frac{1}{N} \sum_{i=1}^{N} 1{E_i = E_{\text{ref}}}$ | Measures the consistency of edits across similar prompts. |
Target Grounding Accuracy | $\text{TGA} = \frac{\text{Correct Targets}}{\text{Total Targets}}$ | Evaluates how well edits align with specified targets in the prompt. | |
Embedding Space Similarity | $\text{CosSim}(v_x, v_{x'}) = \frac{v_x \cdot v_{x'}}{||v_x|| , ||v_{x'}||}$ | Measures similarity between the edited and reference images in feature space. | |
Decomposed Requirements Following Ratio (DRFR) | $\text{DRFR} = \frac{1}{N} \sum_{i=1}^{N} \frac{\text{Requirements Followed}}{\text{Total Requirements}}$ | Assesses how closely the model follows decomposed instructions. | |
User-Based Metrics | User Study Ratings | Captures user feedback through ratings of image quality. | |
Human Visual Turing Test (HVTT) | $\text{HVTT} = \frac{\text{Real Judgements}}{\text{Total Judgements}}$ | Measures the ability of users to distinguish between real and generated images. | |
Click-through Rate (CTR) | $\text{CTR} = \frac{\text{Clicks}}{\text{Total Impressions}}$ | Tracks user engagement by measuring image clicks. | |
Diversity and Fidelity | Edit Diversity | $\text{Diversity} = \frac{1}{N} \sum_{i=1}^{N} D_{KL}(p_i || p_{\text{mean}})$ | Measures the variability of generated images. |
GAN Discriminator Score | $\text{GDS} = \frac{1}{N} \sum_{i=1}^N D_{\text{GAN}}(x_i)$ | Assesses the authenticity of generated images using a GAN discriminator. | |
Reconstruction Error | $\text{RE} = ||x - \hat{x}||$ | Measures the error between the original and generated images. | |
Edit Success Rate | $\text{ESR} = \frac{\text{Successful Edits}}{\text{Total Edits}}$ | Quantifies the success of applied edits. | |
Consistency and Cohesion | Scene Consistency | $\text{SC} = \frac{1}{N} \sum_{i=1}^{N} \text{Sim}(I_{\text{edit}}, I_{\text{orig}})$ | Measures how edits maintain overall scene structure. |
Color Consistency | $\text{CC} = \frac{1}{N} \sum_{i=1}^{N} \frac{|C_{\text{edit}} \cap C_{\text{orig}}|}{|C_{\text{orig}}|}$ | Measures color preservation between edited and original regions. | |
Shape Consistency | $\text{ShapeSim} = \frac{1}{N} \sum_{i=1}^{N} \text{IoU}(S_{\text{edit}}, S_{\text{orig}})$ | Quantifies how well shapes are preserved during edits. | |
Pose Matching Score | $\text{PMS} = \frac{1}{N} \sum_{i=1}^{N} \text{Sim}(\theta_{\text{edit}}, \theta_{\text{orig}})$ | Assesses pose consistency between original and edited images. | |
Robustness | Noise Robustness | $\text{NR} = \frac{1}{N} \sum_{i=1}^{N} ||x_i - x_{i,\text{noisy}}||$ | Evaluates model robustness to noise. |
Perceptual Quality | $\text{PQ} = \frac{1}{N} \sum_{i=1}^{N} \text{Score}(x_i)$ | A subjective quality metric based on human judgment. |
Disclaimer
Feel free to contact us if you have any queries or exciting news. In addition, we welcome all researchers to contribute to this repository and further contribute to the knowledge of this field.
If you have some other related references, please feel free to create a Github issue with the paper information. We will glady update the repos according to your suggestions. (You can also create pull requests, but it might take some time for us to do the merge)