Home

Awesome

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Long Lian, Boyi Li, Adam Yala, Trevor Darrell at UC Berkeley/UCSF.

Transactions on Machine Learning Research (TMLR), with Featured Certification

Paper | Project Page | 5-minute Blog Post | HuggingFace Demo (updated!) | Citation | LLM-grounded Video Diffusion Models

TL;DR: Text Prompt -> LLM as a Request Parser -> Intermediate Representation (such as an image layout) -> Stable Diffusion -> Image.

Main Image Visualizations: Enhanced Prompt Understanding

Updates

[2024.1] Added a result with self-hosted Mixtral-8x7B-Instruct-v0.1 (see our reference benchmark results section). Surprisingly, the Mixtral model's performance is comparable with GPT-3.5. This shows that it's possible to self-host LMD/LMD+ without external API calls to LLMs to achieve good results.

[2023.11] Our LLM-grounded Diffusion (LMD+) has been officially integrated to upstream diffusers v0.24.0! This is an example colab that shows using our pipeline with official diffusers. The implementation in upstream diffusers is a simplified LMD+, and we recommend using the current full repo to reproduce our results.

<details> <summary>Using our pipeline with only a few lines of code with official diffusers</summary>
# Requires diffusers >= 0.24.0

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "longlian/lmd_plus",
    custom_pipeline="llm_grounded_diffusion",
    custom_revision="main",
    variant="fp16", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

# An example prompt with LLM response
prompt = "a waterfall and a modern high speed train in a beautiful forest with fall foliage"
llm_response = """
[('a waterfall', [71, 105, 148, 258]), ('a modern high speed train', [255, 223, 181, 149])]
Background prompt: A beautiful forest with fall foliage
Negative prompt:
"""

phrases, boxes, bg_prompt, neg_prompt = pipe.parse_llm_response(llm_response)

# Use `LLMGroundedDiffusionPipeline` to generate an image
images = pipe(
    prompt=prompt,
    negative_prompt=neg_prompt,
    phrases=phrases,
    boxes=boxes,
    gligen_scheduled_sampling_beta=0.4,
    output_type="pil",
    num_inference_steps=50,
    lmd_guidance_kwargs={}
).images

# PIL Image:
images[0]
</details>

[2023.10] Our repo now supports using SDXL for high-quality generation with SDXL Refiner! Simply add --sdxl to generation command to use it. You can also use --sdxl-step-ratio to control the strength of the refinement (use 0.5 for stronger refinement and 0.1 for weaker refinement). See examples above.

[2023.10] Please also check out our new work <a href='https://llm-grounded-video-diffusion.github.io/'>LLM-grounded Video Diffusion Models (LVD)</a>, which shows that LLMs have knowledge in their weights that can ground video diffusion models 🔥🔥🔥!

[2023.8] Our repo has been largely improved: now we have a repo with many methods implemented, including our training-free LMD and LMD+ (LMD with GLIGEN adapters).

[2023.6] Our huggingface WebUI demo for stage 1 and 2 is updated: now we support enabling each of the guidance components to get a taste of contributions! Check it out here.

Our WebUI is also available to run locally. The instructions to run our WebUI locally to get faster generation without queues are here.

Our repo implements the following layout-to-image methods (stage 2)

These methods can be freely combined with our proposed LLM-based box-to-layout method (stage 1) also implemented in this repo.

Feel free to contact me / submit a pull request to add your methods!

Our repo's features

<details> <summary> And more exciting features! Expand to see. </summary> </details>

LLM-grounded Diffusion (LMD)

We provide instructions to run our code in this section.

Installation

pip install -r requirements.txt

Stage 1: Text-to-Layout Generation

Note that we have uploaded the layout caches into this repo so that you can skip this step if you don't need layouts for new prompts.

Since we have cached the layout generation (which will be downloaded when you clone the repo), you need to remove the cache in cache directory if you want to re-generate the layout with the same prompts.

Our layout generation format: The LLM takes in a text prompt describing the image and outputs three elements: 1. captioned boxes, 2. a background prompt, 3. a negative prompt (useful if the LLM wants to express negation). The template and examples are in prompt.py. You can edit the template and the parsing function to ask the LLM to generate additional things or even perform chain-of-thought for better generation.

Option 1 (automated): Use an OpenAI API key

If you have an OpenAI API key, you can put the API key in utils/api_key.py or set OPENAI_API_KEY environment variable. Then you can use OpenAI's API for batch text-to-layout generation by querying an LLM, with GPT-4 as an example:

python prompt_batch.py --prompt-type demo --model gpt-4 --auto-query --always-save --template_version v0.1

--prompt-type demo includes a few prompts for demonstrations. The layout generation will be cached so it does not query the LLM again with the same prompt (lowers the cost).

You can visualize the bounding boxes in img_generations/imgs_demo_templatev0.1.

Option 2 (free): Manually copy and paste to ChatGPT

python prompt_batch.py --prompt-type demo --model gpt-4 --always-save --template_version v0.1

Then copy and paste the template to ChatGPT. Note that you want to use GPT-4 or change the --model to gpt-3.5 in order to match the cache file name. Then copy the response back. The generation will be cached.

If you want to visualize before deciding to save or not, you don't need to pass in --always-save.

Run our benchmark on text-to-layout generation evaluation

We provide a benchmark that applies both to stage 1 and stage 2. This benchmarks includes a set of prompts with four tasks (negation, numeracy, attribute binding, and spatial relationships) as well as unified benchmarking code for all implemented methods and both stages.

This will generate layouts from the prompts in the benchmark (with --prompt-type lmd) and evaluate the results:

python prompt_batch.py --prompt-type lmd --model gpt-3.5 --auto-query --always-save --template_version v0.1
python scripts/eval_stage_one.py --prompt-type lmd --model gpt-3.5 --template_version v0.1
<details> <summary>Our reference benchmark results (stage 1, evaluating the generated layouts only)</summary>
MethodNegationNumeracyAttributionSpatialOverall
GPT-3.5100971009999.0%
GPT-4100100100100100.0%
<!-- * GPT-3.5: ``` Eval type: negation, success: 100/100, rate: 1.00 Eval type: numeracy, success: 97/100, rate: 0.97 Eval type: attribution, success: 100/100, rate: 1.00 Eval type: spatial, success: 99/100, rate: 0.99 Overall: success: 396/400, rate: 0.99 ``` * GPT-4: ``` Eval type: negation, success: 100/100, rate: 1.00 Eval type: numeracy, success: 100/100, rate: 1.00 Eval type: attribution, success: 100/100, rate: 1.00 Eval type: spatial, success: 100/100, rate: 1.00 Overall: success: 400/400, rate: 1.00 ``` --> </details>

Stage 2: Layout-to-Image Generation

Note that since we provide caches for stage 1, you don't need to run stage 1 on your own for cached prompts that we provide (i.e., you don't need an OpenAI API key or to query an LLM).

Run layout-to-image generation using the gpt-4 cache and LMD+:

python generate.py --prompt-type demo --model gpt-4 --save-suffix "gpt-4" --repeats 5 --frozen_step_ratio 0.5 --regenerate 1 --force_run_ind 0 --run-model lmd_plus --no-scale-boxes-default --template_version v0.1

--save-suffix is the suffix added to the name of the run. You can change that if you change the args to mark the setting in the runs. --run-model specifies the method to run. You can set to LMD/LMD+ or the implemented baselines (with examples below). Use --use-sdv2 to enable SDv2.

Run our benchmark on layout-to-image generation evaluation

We use a unified evaluation metric as stage 1 in stage 2 (--prompt-type lmd). Since we have layout boxes for stage 1 but only images for stage 2, we use OWL-ViT in order to detect the objects and ensure they are generated (or not generated in negation) in the right number, with the right attributes, and in the right place.

This runs generation with LMD+ and evaluate the generation:

# Use GPT-3.5 layouts
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --frozen_step_ratio 0.5 --regenerate 1 --force_run_ind 0 --run-model lmd_plus --no-scale-boxes-default --template_version v0.1
python scripts/owl_vit_eval.py --model gpt-3.5 --run_base_path img_generations/img_generations_templatev0.1_lmd_plus_lmd_gpt-3.5/run0 --skip_first_prompts 0 --prompt_start_ind 0 --verbose --detection_score_threshold 0.15 --nms_threshold 0.15 --class-aware-nms
# Use GPT-4 layouts
python generate.py --prompt-type lmd --model gpt-4 --save-suffix "gpt-4" --repeats 1 --frozen_step_ratio 0.5 --regenerate 1 --force_run_ind 0 --run-model lmd_plus --no-scale-boxes-default --template_version v0.1
python scripts/owl_vit_eval.py --model gpt-4 --run_base_path img_generations/img_generations_templatev0.1_lmd_plus_lmd_gpt-4/run0 --skip_first_prompts 0 --prompt_start_ind 0 --verbose --detection_score_threshold 0.15 --nms_threshold 0.15 --class-aware-nms
Our reference benchmark results
MethodNegationNumeracyAttributionSpatialOverall
SD v1.52839522836.8%
LMD+ (GPT-3.5)10086696780.5%
LMD+ (GPT-4)10084798286.3%
LMD+ (StableBeluga2*)8860566467.0%
LMD+ (Mixtral-8x7B-Instruct-v0.1*)9872627877.5%

* StableBeluga2 is an open-sourced model based on Llama 2. Mixtral-8x7B-Instruct-v0.1 is an open-sourced MoE model that can be served on 1x A100 if quantized. We discover that the fact that LLMs' spatial reasoning ability is also applicable to open-sourced models. Surprisingly, the Mixtral model's performance is close to the one with GPT-3.5. This shows that it's possible to self-host LMD/LMD+ without external API calls to LLMs to achieve good results. However, it can still be improved, compared to proprietary model GPT-4. We leave LLM fine-tuning for better layout generation in stage 1 to future research.

To run generation with LMD with original SD weights and evaluate the generation:

<details> <summary>Generate and evaluate samples with LMD</summary>
# Use GPT-3.5 layouts
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --frozen_step_ratio 0.5 --regenerate 1 --force_run_ind 0 --run-model lmd --no-scale-boxes-default --template_version v0.1
python scripts/owl_vit_eval.py --model gpt-3.5 --run_base_path img_generations/img_generations_templatev0.1_lmd_lmd_gpt-3.5/run0 --skip_first_prompts 0 --prompt_start_ind 0 --verbose --detection_score_threshold 0.15 --nms_threshold 0.15 --class-aware-nms

Note: You can enable autocast (mixed precision) to reduce the memory used in generation with --use_autocast 1 with potentially slightly lower generation quality.

</details> <details> <summary>Generate samples with other stage 2 baseline methods</summary>
# SD v1.5
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --regenerate 1 --force_run_ind 0 --run-model sd --no-scale-boxes-default --template_version v0.1 --ignore-negative-prompt
# MultiDiffusion (training-free)
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --regenerate 1 --force_run_ind 0 --run-model multidiffusion --no-scale-boxes-default --template_version v0.1 --multidiffusion_bootstrapping 10
# Backward Guidance (training-free)
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --regenerate 1 --force_run_ind 0 --run-model backward_guidance --no-scale-boxes-default --template_version v0.1
# Boxdiff (training-free, our reimplementation)
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --regenerate 1 --force_run_ind 0 --run-model boxdiff --no-scale-boxes-default --template_version v0.1
# GLIGEN (training-based)
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --regenerate 1 --force_run_ind 0 --run-model gligen --no-scale-boxes-default --template_version v0.1

Note: we set --ignore-negative-prompt in SD v1.5 so that SD generation does not depend on the LLM and follows a text-to-image generation baseline (otherwise we take the LLM-generated negative prompts and put them into the negative prompt). For other baselines, you can feel free to generate. Evaluation is similar to LMD+, except you need to change the image path in the evaluation command.

</details> <details> <summary>Our reference benchmark results (stage 2, LMD, without autocast)</summary>
MethodNegationNumeracyAttributionSpatialOverall
SD v1.52839522836.8%
LMD (GPT-3.5)10062657976.5%
</details>
Ablation: Our reference benchmark results by combining LMD stage 1 with various layout-to-image baselines as stage 2</summary>

The stage 1 in this table is LMD (GPT-3.5) unless stated otherwise. We keep stage 1 in LMD the same and replace the stage 2 by other layout-to-image methods.

Stage 1 / Stage 2 MethodNegation*NumeracyAttributionSpatialOverall
None / SD v1.52839522836.8%
Training-free: <br/> (uses SD weights out-of-the-box)
LMD / MultiDiffusion10030423652.0%
LMD / Backward Guidance10042366159.8%
LMD / BoxDiff10032556262.3%
LMD / LMD10062657976.5%
Training-based:
LMD / GLIGEN10057574564.8%
LMD / LMD+**10086696780.5%
LMD / LMD+ (GPT-4)10084798286.3%

* All methods equipped with LMD stage 1 understand negation well because LMD stage 1 generates the negative prompts, which is applicable to all methods that use classifier-free guidance on SD.

** Note that LMD+ uses attention control that we proposed in addition to GLIGEN, which has much better generation compared to using only GLIGEN, showing that our proposed training-free control is orthogonal to training-based methods such as GLIGEN.

FAQs

How do I use open-source LLMs (e.g., Mixtral, LLaMA-2, StableBeluga2, Vicuna)?

You can install fastchat and start a LLM server (note that the server does not have to be the same one as this repo). This requires running three terminals (e.g., three tmux windows). Using Mixtral-8x7B-Instruct-v0.1 as an example (which performs the best among all open-source LLMs from our experience):

pip install fschat

export FASTCHAT_WORKER_API_TIMEOUT=600
# Run this in window 1
python3 -m fastchat.serve.controller

# Run this in window 2
CUDA_VISIBLE_DEVICES=0,1 python3 -m fastchat.serve.model_worker --model-path mistralai/Mixtral-8x7B-Instruct-v0.1 --num-gpus 2 --max-gpu-memory 48GiB
# Command for StableBeluga2:
# CUDA_VISIBLE_DEVICES=0,1 python3 -m fastchat.serve.model_worker --model-path stabilityai/StableBeluga2 --num-gpus 2

# Run this in window 3
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

StableBeluga2 is a 70b model so you need at least 2 GPUs, and Mixtral-8x7B-Instruct-v0.1 also requires 2 GPUs (with 80GB memory), but you can run smaller models with only 1 GPU. Simply replace the model path to the huggingface model key (e.g., meta-llama/Llama-2-70b-hf, lmsys/vicuna-33b-v1.3). Note that you probably want models without RLHF (e.g., not Llama-2-70b-chat-hf), as we use text completion endpoints for layout generation, although Mixtral with instruction tuning seems to perform slightly better than Mixtral without instruction tuning. Then change the --model argument to the intended model.

If your LLM server is not on localhost:8000, you can change the API endpoint URL in utils/llm.py. If you model name is not in the list in utils/llm.py, you can add it to the model_names list. We created this list to prevent typos in the command.

My LLM queries finished very quickly, why?

Check whether you have a lot of cache hits in the output. If so, you might want to use the cache (you are all set) or remove the cache in cache directory to regenerate.

Note that we allows different versions of templates so that you can manage several templates easily without cache overwrites.

Contact us

Please contact Long (Tony) Lian if you have any questions: longlian@berkeley.edu.

Acknowledgements

This repo uses code from diffusers, GLIGEN, and layout-guidance. This code also has an implementation of boxdiff and MultiDiffusion (region control). Using their code means adhering to their license. The code that is not from other repos adhere to MIT License, with an additional note:

THIS SOFTWARE AND/OR DATA WAS DEPOSITED IN THE BAIR OPEN RESEARCH COMMONS REPOSITORY ON 9/9/24.

Citation

If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation.

@article{lian2023llmgrounded,
    title={LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models}, 
    author={Lian, Long and Li, Boyi and Yala, Adam and Darrell, Trevor},
    journal={arXiv preprint arXiv:2305.13655},
    year={2023}
}