Awesome

<span>FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

Pytorch implementation of FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

Runze He, Kai Ma, Linjiang Huang, Shaofei Huang, Jialin Gao, Xiaoming Wei, Jiao Dai, Jizhong Han, Si Liu

Introduction

FreeEdit consists of three components: (a) Multi-modal instruction encoder. (b) Detail extractor. (c) Denosing U-Net. Text instruction and reference image are firstly fed into the multi-modal instruction encoder to generate multi-modal instruction embedding. The reference image is additionally fed into the detail extractor to obtain fine-grained features. The original image latent is concatenated with the noise latent to introduce the original image condition. Denosing U-Net accepts the 8-channel input and interacts with the multi-modal instruction embedding through cross-attention. The DRRA modules which connect the detail extractor and the denoising U-Net, are used to integrate fine-grained features from the detail extractor to promote ID consistency with the reference image. (d) The editing examples obtained using FreeEdit.

Citation

@misc{he2024freeedit,
      title={FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction}, 
      author={Runze He and Kai Ma and Linjiang Huang and Shaofei Huang and Jialin Gao and Xiaoming Wei and Jiao Dai and Jizhong Han and Si Liu},
      year={2024},
      eprint={2409.18071},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.18071}, 
}

Contact

If you have any comments or questions, feel free to contact Runze He.