Home

Awesome

UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image

Implementation UniTune base on stable diffusion, not official code, based on https://github.com/JoePenna/Dreambooth-Stable-Diffusion

Colab link

https://colab.research.google.com/drive/1V6nM-iGHQi7TDqMDEn4ohO6-MmiQMm6L?usp=sharing, simple version based on https://github.com/ShivamShrirao/diffusers/blob/main/examples/dreambooth/DreamBooth_Stable_Diffusion.ipynb, support t0 parameter: https://colab.research.google.com/drive/1WA1CDQrKDbaLjvfxvQXyUGsZ5ldx9IIe?usp=sharing

Train

python3 -u main.py --base configs/stable-diffusion/v1-finetune_unfrozen.yaml -t --actual_resume ./sd-v1-4-full-ema.ckpt -n "dog" --gpus 0, --max_training_steps 100 --token "mmdd111" --data_root ./training_images/dog --token_only --n_accumulate 4

Inference

python3 stable_txt2img.py --ddim_eta 0.0 --n_samples 4 --n_iter 1 --ddim_steps 50 --ckpt logs/dog2022-11-09T03-22-15_dog/checkpoints/last.ckpt --prompt "mmdd111 dog in pixar style" --blendmodel result

python3 stable_txt2img.py --ddim_eta 0.0 --n_samples 4 --n_iter 1 --ddim_steps 50 --ckpt logs/dog2022-11-09T03-22-15_dog/checkpoints/last.ckpt --prompt "mmdd111 dog sitting down" result

python3 stable_txt2img.py --ddim_eta 0.0 --n_samples 4 --n_iter 1 --ddim_steps 50 --ckpt logs/dog2022-11-09T03-22-15_dog/checkpoints/last.ckpt --prompt "mmdd111 lion" result

python3 stable_txt2img.py --ddim_eta 0.0 --n_samples 4 --n_iter 1 --ddim_steps 50 --ckpt logs/dog2022-11-09T03-22-15_dog/checkpoints/last.ckpt --prompt "mmdd111 dog in Simpsons style" --blendmodel result

python3 stable_txt2img.py --ddim_eta 0.0 --n_samples 4 --n_iter 1 --ddim_steps 50 --ckpt logs/dog2022-11-09T03-22-15_dog/checkpoints/last.ckpt --prompt "mmdd111 dog in Simpsons style" --blendmodel --scale 16 result

Implementation deatils

Index

The Repo Formerly Known As "Dreambooth"

image

<a name="notes-by-joe-penna"></a> Notes by Joe Penna

INTRODUCTIONS!

Hi! My name is Joe Penna.

You might have seen a few YouTube videos of mine under MysteryGuitarMan. I'm now a feature film director. You might have seen ARCTIC or STOWAWAY.

For my movies, I need to be able to train specific actors, props, locations, etc. So, I did a bunch of changes to @XavierXiao's repo in order to train people's faces.

I can't release all the tests for the movie I'm working on, but when I test with my own face, I release those on my Twitter page - @MysteryGuitarM.

Lots of these tests were done with a buddy of mine -- Niko from CorridorDigital. It might be how you found this repo!

I'm not really a coder. I'm just stubborn, and I'm not afraid of googling. So, eventually, some really smart folks joined in and have been contributing. In this repo, specifically: @djbielejeski @gammagec @MrSaad –– but so many others in our Discord!

This is no longer my repo. This is the people-who-wanna-see-Dreambooth-on-SD-working-well's repo!

Now, if you wanna try to do this... please read the warnings below first:

WARNING!

<a name="setup"></a> Setup

<a name="easy-runpod-instructions"></a> Easy RunPod Instructions

VIDEO INSTRUCTIONS

<a name="vast-ai-setup"></a> Vast.AI Instructions

<a name="text-vs-dreamb"></a> Textual Inversion vs. Dreambooth

The majority of the code in this repo was written by Rinon Gal et. al, the authors of the Textual Inversion research paper. Though a few ideas about regularization images and prior loss preservation (ideas from "Dreambooth") were added in, out of respect to both the MIT team and the Google researchers, I'm renaming this fork to: "The Repo Formerly Known As "Dreambooth"".

For an alternate implementation , please see "Alternate Option" below.

<a name="using-the-generated-model"></a> Using the generated model

The ground truth (real picture, caution: very beautiful woman) <br><img src="https://user-images.githubusercontent.com/100188076/192403948-8d1d0e50-3e9f-495f-b8ba-1bcb6b536fc8.png" width="200">

Same prompt for all of these images below:

sks personwoman personNatalie Portman personKate Mara person
<img src="https://user-images.githubusercontent.com/100188076/192403506-ab96c652-f7d0-47b0-98fa-267defa1e511.png" width="200"><img src="https://user-images.githubusercontent.com/100188076/192403491-cb258777-5091-4492-a6cc-82305fa729f4.png" width="200"><img src="https://user-images.githubusercontent.com/100188076/192403437-f9a93720-d41c-4334-8901-fa2d2a10fe36.png" width="200"><img src="https://user-images.githubusercontent.com/100188076/192403461-1f6972d9-64d0-46b0-b2ed-737e47aae31e.png" width="200">

<a name="debugging-your-results"></a> Debugging your results

❗❗ THE NUMBER ONE MISTAKE PEOPLE MAKE ❗❗

Prompting with just your token. ie "joepenna" instead of "joepenna person"

If you trained with joepenna under the class person, the model should only know your face as:

joepenna person

Example Prompts:

🚫 Incorrect (missing person following joepenna)

portrait photograph of joepenna 35mm film vintage glass

✅ This is right (person is included after joepenna)

portrait photograph of joepenna person 35mm film vintage glass

You might sometimes get someone who kinda looks like you with joepenna (especially if you trained for too many steps), but that's only because this current iteration of Dreambooth overtrains that token so much that it bleeds into that token.


☢ Be careful with the types of images you train

While training, Stable doesn't know that you're a person. It's just going to mimic what it sees.

So, if these are your training images look like this:

You're only going to get generations of you outside next to a spiky tree, wearing a white-and-gray shirt, in the style of... well, selfie photograph.

Instead, this training set is much better:

The only thing that is consistent between images is the subject. So, Stable will look through the images and learn only your face, which will make "editing" it into other styles possible.

Oh no! You're not getting good generations!

<a name="they-dont-look-like-you"></a> OPTION 1: They're not looking like you at all! (Train longer, or get better training images)

Are you sure you're prompting it right?

It should be <token> <class>, not just <token>. For example:

JoePenna person, portrait photograph, 85mm medium format photo

If it still doesn't look like you, you didn't train long enough.


<a name="they-sorta-look-like-you-but-exactly-like-your-training-images"></a> OPTION 2: They're looking like you, but are all looking like your training images. (Train for less steps, get better training images, fix with prompting)

Okay, a few reasons why: you might have trained too long... or your images were too similar... or you didn't train with enough images.

No problem. We can fix that with the prompt. Stable Diffusion puts a LOT of merit to whatever you type first. So save it for later:

an exquisite portrait photograph, 85mm medium format photo of JoePenna person with a classic haircut


<a name="they-look-like-you-but-not-when-you-try-different-styles"></a> OPTION 3: They're looking like you, but not when you try different styles. (Train longer, get better training images)

You didn't train long enough...

No problem. We can fix that with the prompt:

JoePenna person in a portrait photograph, JoePenna person in a 85mm medium format photo of JoePenna person

More tips and help here: Stable Diffusion Dreambooth Discord

<a name="hugging-face-diffusers"></a> Hugging Face Diffusers - Alternate Option

Dreambooth is now supported in HuggingFace Diffusers for training with Stable Diffusion.

Try it out here:

Open In Colab