Home

Awesome

<div align="center"> <h2> LLMs Meet Multimodal Generation and Editing: A Survey </h2> <a href='https://arxiv.org/abs/2405.19334'><img src='https://img.shields.io/badge/ArXiv-2405.19334-red'></a> </div>

🤗 Introduction

</p>

📋 Contents

💘 Tips

📍 Multimodal Generation

Image Generation

🔅 LLM-based

<!-- + **Genie: Generative Interactive Environments** (26 Feb 2024)<details><summary>Jake Bruce, Michael Dennis, Ashley Edwards, et al.</summary> Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh, Tim Rocktäschel</details> [![Paper](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.15391v1) -->

Non-LLM-based (Clip/T5)

Datasets

Video Generation

🔅 LLM-based

Non-LLM-based

Datasets

3D Generation

🔅 LLM-based

Non-LLM-based (Clip/T5)

Datasets

Audio Generation

🔅 LLM-based

Non-LLM-based

Datasets

Generation with Multiple Modalities

🔅 LLM-based

Non-LLM-based

📍 Multimodal Editing

Image Editing

🔅 LLM-based

Non-LLM-based (Clip/T5)

<!-- [![Code](https://img.shields.io/github/stars/pix2pixzero/pix2pix-zero.svg?style=social&label=Star)](https://github.com/pix2pixzero/pix2pix-zero) --> <!-- [![Project_Page](https://img.shields.io/badge/Project_Page-00CED1)](https://ashmrz.github.io/WatchYourSteps/) --> <!-- [![Code](https://img.shields.io/github/stars/Xiang-cd/DiffEdit-stable-diffusion.svg?style=social&label=Star)](https://github.com/Xiang-cd/DiffEdit-stable-diffusion) -->

Video Editing

🔅 LLM-based

<!-- [![Code](https://img.shields.io/github/stars/duyguceylan/pix2video.svg?style=social&label=Star)](https://github.com/duyguceylan/pix2video) -->

Non-LLM-based (Clip/T5)

<!-- [![Project_Page](https://img.shields.io/badge/Project_Page-00CED1)](https://diffusion-tokenflow.github.io/) --> <!-- [![Code](https://img.shields.io/github/stars/omerbt/TokenFlow.svg?style=social&label=Star)](https://github.com/omerbt/TokenFlow) -->

3D Editing

🔅 LLM-based

Non-LLM-based (Clip/T5)

Audio Editing

🔅 LLM-based

Non-LLM-based (Clip/T5)

📍 Multimodal Agents

📍 Multimodal Understanding with LLMs

Multiple modalities

Image Understanding

Video Understanding

3D Understanding

Audio Understanding

📍 Multimodal LLM Safety

Attack

Defense and Detect

Alignment

Datasets

3D, Video and Audio Safety

📍 Related Surveys

LLM

Vision

👨‍💻 Team

Here is the list of our contributors in each modality of this repository.

Modality/TaskContributors
Image GenerationJingye Chen, Xiaowei Chi, Yingqing He
Video GenerationYingqing He, Xiaowei Chi, Jingye Chen
Image and Video EditingYazhou Xing
3D Generation and EditingHongyu Liu
Audio Generation and EditingZeyue Tian, Ruibin Yuan
LLM AgentZhaoyang Liu
SafetyRuntao Liu
LeadersYingqing He, Zhaoyang Liu

😉 Citation

If you find this work useful in your research, Please cite the paper as below:

@article{he2024llms,
    title={LLMs Meet Multimodal Generation and Editing: A Survey},
    author={He, Yingqing and Liu, Zhaoyang and Chen, Jingye and Tian, Zeyue and Liu, Hongyu and Chi, Xiaowei and Liu, Runtao and Yuan, Ruibin and Xing, Yazhou and Wang, Wenhai and Dai, Jifeng and Zhang, Yong and Xue, Wei and Liu, Qifeng and Guo, Yike and Chen, Qifeng},
    journal={arXiv preprint arXiv:2405.19334},
    year={2024},
}

⭐️ Star History

Star History Chart