

<p align="center" width="10%"> <img src="imgs/logo.png" style="width: 30%" align=center> </p>

LLMGA: Multimodal Large Language Model-based Generation Assistant (ECCV2024 Oral)

Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, and Jiaya Jia

<a href="https://llmga.github.io/"><img src="https://img.shields.io/badge/Project-Page-Green"></a> <a href="https://arxiv.org/pdf/2311.16500.pdf"><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/binxia'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a> <a href='https://huggingface.co/datasets/binxia/LLMGA-datasetv2/tree/main'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a>


New Version (Accepted by ECCV2024):

Old Version:

Abstract: In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting & outpainting, and instruction-based editing. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during inpainting and outpainting. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications in an interactive manner.

Why do you need LLMGA?

<div align=center> <img width="100%" src="imgs/github_poster1.png"/> </div> <div align=center> <img width="100%" src="imgs/demo1.png"/> </div> <div align=center> <img width="100%" src="imgs/demo2.png"/> </div>




Please follow the instructions below to install the required packages.

  1. Clone this repository
git clone https://github.com/dvlab-research/LLMGA.git
  1. Install Package
conda create -n llmga python=3.9 -y
conda activate llmga
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
cd ./llmga/diffusers
pip install . 
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install datasets
pip install albumentations
pip install ninja


<div align=center> <img width="100%" src="imgs/method.png"/> </div>


Training Dataset

We provide the training data for LLMGA training.

please download LLMGA datasets and LLaVA pretrain datasets.

Besides, download LLaVA1.5 instruction tuning datasets llava_v1_5_mix665k.json, and download the images from constituting datasets:

Please organize these downloaded data as in Structure.

The MLP Projector Pretrained Weights

We recommend users to download the pretrained MLP projector weights. Then put them in ./checkpoints following Structure.

Inference Pretrained Weights

Please download MLLM Models and SD models from the following links. For example, you can download LLMGA-MLLM7b and LLMGA-SDXL-T2I to realize LLMGA7b-T2I functionality. Please organize them as in Structure.

<table> <tr> <th align="left">MLLM Model (support English)</th> <th align="center">Pretrained Models</th> </tr> <tr> <td align="left">llmga-vicuna 7b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-vicuna-7b-v1.5-full-finetune/tree/main">Download</a></td> </tr> <tr> <td align="left">llmga-mistral 7b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-mistral_instruct-full-finetune/tree/main">Download</a></td> </tr> <tr> <td align="left">llmga-llama3 8b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-llama3-8b-it-full-finetune/tree/main">Download</a></td> </tr> <tr> <td align="left">llmga-qwen2 0.5b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-Qwen2-0.5B-full-finetune/tree/main">Download</a></td> </tr> <tr> <td align="left">llmga-qwen2 1.5b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-Qwen2-1.5B-full-finetune/tree/main">Download</a></td> </tr> <tr> <td align="left">llmga-qwen2 7b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-Qwen2-7B-full-finetune/tree/main">Download</a></td> </tr> <tr> <td align="left">llmga-phi3 3b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-Phi-3-mini-128k-full-finetune/tree/main">Download</a></td> </tr> <tr> <td align="left">llmga-gemma 2b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-gemma-2b-it-full-finetune/tree/main">Download</a></td> </tr> </table> <table> <tr> <th align="left">MLLM Model (further support Chinese and English)</th> <th align="center">Pretrained Models</th> </tr> <tr> <td align="left">llmga-cn-vicuna 7b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-cn-vicuna-7b-v1.5-full-finetune">Download</a></td> </tr> <tr> <td align="left">llmga-cn-llama3 8b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-cn-llama3-8b-it-full-finetune">Download</a></td> </tr> <tr> <td align="left">llmga-cn-gemma 2b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-cn-gemma-2b-it-full-finetune">Download</a></td> </tr> <tr> <td align="left">llmga-cn-qwen2 0.5b</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-cn-Qwen2-0.5B-full-finetune">Download</a></td> </tr> </table> <table> <tr> <th align="left">SD Model</th> <th align="center">Pretrained Models</th> </tr> <tr> <td align="left">LLMGA-SD15-T2I</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-sd15-t2i-v2">Download</a></td> </tr> <tr> <td align="left">LLMGA-SD15-Inpainting</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-sd15-inpainting-v2">Download</a></td> </tr> <tr> <td align="left">LLMGA-SDXL-T2I</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-sdxl-t2i">Download</a></td> </tr> <tr> <td align="left">LLMGA-SDXL-Inpainting</td> <td align="center"><a href="https://huggingface.co/binxia/llmga-sdxl-inpainting-v2">Download</a></td> </tr> </table>


The folder structure should be organized as follows before training.

├── llmga
├── scripts
├── work_dirs
├── checkpoints
│   ├── llmga-Phi-3-mini-128k-pretrain
│   ├── llmga-Qwen2-0.5B-pretrain
│   ├── llmga-llama3-8b-pretrain
│   ├── llmga-mistral-pretrain
│   ├── llmga-vicuna-7b-v1.5-pretrain
│   ├── llmga-Phi-3-mini-128k-full-finetune
│   ├── llmga-Qwen2-0.5B-full-finetune
│   ├── llmga-llama3-8b-it-full-finetune
│   ├── llmga-mistral_instruct-full-finetune
│   ├── llmga-vicuna-7b-v1.5-full-finetune
│   ├── llmga-cn-vicuna-7b-v1.5-full-finetune
│   ├── llmga-cn-Qwen2-0.5B-full-finetune
│   ├── llmga-sdxl-t2i
│   ├── llmga-sd15-inpainting-v2
│   ├── llmga-sd15-t2i-v2
├── data
│   │── jsons
│   │   ├── llmga-data
│   │   │   ├── Edit/train.json
│   │   │   ├── inpainting/train.json
│   │   │   ├── SG/train.json
│   │   │   ├── T2I/train.json
│   │   ├── text-data
│   │   │   ├── alpaca_gpt4_sharegpt_en_clean2.json
│   │   │   ├── lima.json
│   │   │   ├── oasst2.json
│   │   ├── llava_v1_5_mix665k.json
│   ├── llmga-imgs
│   │   ├── COCO
│   │   ├── LAION
│   │   ├── JourneyDB
│   ├── llava_pretrain
│   │   ├──images
│   ├── llava-imgs
│   │   ├── coco
│   │   │   ├── train2017
│   │   ├── gqa
│   │   │   ├── images
│   │   ├── ocr_vqa
│   │   │   ├── images
│   │   ├── textvqa
│   │   │   ├── train_images
│   │   ├── vg
│   │   │   ├── VG_100K
│   │   │   ├── VG_100K_2


LLMGA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Please make sure you download and organize the data following Preparation before training. Here, we just take training llmga vicuna 7b as an example. For more model training scripts, please check the ./scripts folder.


bash scripts/pretrain_vicuna_7b.sh

First Stage Training

bash scripts/train_llmga_s1_7b_vicuna.sh

Second Stage Training

train LLMGA based on SD1.5-T2I

bash scripts/train_llmga_s2_sd15_t2i.sh

train LLMGA based on SD1.5-Inpainting

bash scripts/train_llmga_s2_sd15_inpaint.sh


CLI Inference

Use LLMGA without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. With 4-bit quantization. Here, we just give some examples for T2I, inpainting and instruction-based editing. For more model inference scripts, please check the ./scripts folder.

For T2I generation task.

bash scripts/test-llmga-sdxl-t2i.sh

For inpainting or outpainting task.

bash scripts/test-llmga-sd15-inpainting.sh

For instruction based editing task.

bash scripts/test-llmga-sd15-editing.sh

Gradio Inference

bash scripts/run_gradio_t2i.sh


If you find this repo useful for your research, please consider citing the paper

  title={LLMGA: Multimodal Large Language Model based Generation Assistant},
  author={Xia, Bin and Wang, Shiyin, and Tao, Yingfan and Wang, Yitong and Jia, Jiaya},


We would like to thank the following repos for their great work: