Home

Awesome

EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

News

June 23, 2024

Overview

This repository contains the official implementation of our EditWorld. In this work, we introduce a new task namely world-instructed image editing, which defines and categorizes the instructions grounded by various world scenarios. We curate a new image editing dataset with world instructions using a set of large pretrained models (e.g., GPT-3.5, Video-LLava and SDXL). We also propose a new post-edit method for world-instructed image editing.

World Instruction vs. Traditional Instruction

first_img

Generated Results of Our EditWorld:

sample1

Planning

Codebase

Text-to-image generation branch

Firstly, we employ GPT-3.5 to provide textual quadruples:

python gpt_script/text_img_gen_aigcbest_full.py --define_json gpt_script/define_sample_history/define_sample.json --output_path gpt_script/gen_sample_history/ --output_json text_gen.json

Then, we transform the text prompt provided by GPT into dict:

python tools/deal_text2json.py --input_json gpt_script/gen_sample_history/text_gen.json --output_json text_gen_full.json

Finally, we obtain the input-instruct-output triples based on the generated textual quadruples:

python t2i_branch_base.py --text_json text_gen_full.json --save_path datasets/editworld/generated_img/

It is worth noting that t2i_branch_base.py is the fast and basic version for text-to-image generation branch, we will improve this part in the future.

Video branch

Path video_script contains the code for downloading videos from the InternVid.

Dataset

Dataset structure

To obtain the training dataset file train.json, utilize the script located at tools/obtain_datasetjson.py. The dataset is organized in the following structure:

datasets/
├── editworld/
│   ├── generated_img/
│   │   ├── group_0/
│   │   │   ├── sample0_ori.png
│   │   │   ├── sample0_tar.png
│   │   │   ...
│   │   │   └── img_txt.json
│   │   └── group_1/
│   │   ...
│   ├── video_img/
│   │   ├── group_0/
│   │   │   ├── sample0_ori.png
│   │   │   ├── sample0_tar.png
│   │   │   ...
│   │   │   └── img_txt.json
│   │   └── group_1/
│   │   ...
│   └── human_select_img/
│       ├── group_0/
│       │   ├── sample0_ori.png
│       │   ├── sample0_tar.png
│       │   ...
│       │   └── img_txt.json
│       └── group_1/
│       ...
└── train.json

Evaluation dataset link

Our evaluation dataset is available at editworld_test.

Quantitative Comparison of CLIP Score and MLLM Score

IP2P: InstructPix2Pix; MB: MagicBrush. Bold results are the best.

CLIP Score of Text-to-image Branch

CategoryIP2PMBEditworldw/o post-edit
Long-Term0.21400.18700.22440.2294
Physical-Trans0.21860.21010.23850.2467
Implicit-Logic0.23900.24320.25420.2440
Story-Type0.20630.20700.25340.2354
Real-to-Virtual0.22850.23440.25240.2435

CLIP Score of Video Branch

CategoryIP2PMBEditworldw/o post-edit
Spatial-Trans0.21750.19970.24200.2286
Physical-Trans0.23150.22780.24670.2483
Story-Type0.23180.22620.23650.2399
Exaggeration0.24160.23280.24430.2433

MLLM Score of Both Branches

CategoryIP2PMBEditworldw/o post-edit
Text-to-image0.87630.84550.89580.9060
Video0.94930.97150.99200.9891

Citation

@article{yang2024editworld,
  title={EditWorld: Simulating World Dynamics for Instruction-Following Image Editing},
  author={Yang, Ling and Zeng, Bohan and Liu, Jiaming and Li, Hong and Xu, Minghao and Zhang, Wentao and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2405.14785},
  year={2024}
}