Home

Awesome

minDALL-E on Conceptual Captions

minDALL-E, named after minGPT, is a 1.3B text-to-image generation model trained on 14 million image-text pairs for non-commercial purposes.

a painting of a bird in the style of asian painting a photo of san francisco's golden gate bridge in black and white tone

Environment Setup

PyTorch == 1.8.0
CUDA >= 10.1
pip install -r requirements.txt

Model Checkpoint

Sampling

from matplotlib import pyplot as plt
import clip
from dalle.models import Dalle
from dalle.utils.utils import set_seed, clip_score

device = 'cuda:0'
set_seed(0)

prompt = "A painting of a monkey with sunglasses in the frame"
model = Dalle.from_pretrained('minDALL-E/1.3B')  # This will automatically download the pretrained model.
model.to(device=device)

# Sampling
images = model.sampling(prompt=prompt,
                        top_k=256, # It is recommended that top_k is set lower than 256.
                        top_p=None,
                        softmax_temperature=1.0,
                        num_candidates=96,
                        device=device).cpu().numpy()
images = np.transpose(images, (0, 2, 3, 1))

# CLIP Re-ranking
model_clip, preprocess_clip = clip.load("ViT-B/32", device=device)
model_clip.to(device=device)
rank = clip_score(prompt=prompt,
                  images=images,
                  model_clip=model_clip,
                  preprocess_clip=preprocess_clip,
                  device=device)

# Plot images
images = images[rank]
plt.imshow(images[0])
plt.show()

Samples (Top-K=256, Temperature=1.0)

<p float="left"> <img src="/assets/a painting of a cat with sunglasses in the frame_0.png" width="128" /> <img src="/assets/a painting of a cat with sunglasses in the frame_1.png" width="128" /> <img src="/assets/a painting of a cat with sunglasses in the frame_2.png" width="128" /> <img src="/assets/a painting of a cat with sunglasses in the frame_3.png" width="128" /> <img src="/assets/a painting of a cat with sunglasses in the frame_4.png" width="128" /> <img src="/assets/a painting of a cat with sunglasses in the frame_5.png" width="128" /> </p> <p float="left"> <img src="/assets/a painting of a dog with sunglasses in the frame_0.png" width="128" /> <img src="/assets/a painting of a dog with sunglasses in the frame_1.png" width="128" /> <img src="/assets/a painting of a dog with sunglasses in the frame_2.png" width="128" /> <img src="/assets/a painting of a dog with sunglasses in the frame_3.png" width="128" /> <img src="/assets/a painting of a dog with sunglasses in the frame_4.png" width="128" /> <img src="/assets/a painting of a dog with sunglasses in the frame_5.png" width="128" /> </p> <p float="left"> <img src="/assets/A large pink elephant walking on the beach_0.png" width="128" /> <img src="/assets/A large pink elephant walking on the beach_1.png" width="128" /> <img src="/assets/A large pink elephant walking on the beach_2.png" width="128" /> <img src="/assets/A large pink elephant walking on the beach_3.png" width="128" /> <img src="/assets/A large pink elephant walking on the beach_4.png" width="128" /> <img src="/assets/A large pink elephant walking on the beach_5.png" width="128" /> </p> <p float="left"> <img src="/assets/A large black elephant walking on the beach_0.png" width="128" /> <img src="/assets/A large black elephant walking on the beach_1.png" width="128" /> <img src="/assets/A large black elephant walking on the beach_2.png" width="128" /> <img src="/assets/A large black elephant walking on the beach_3.png" width="128" /> <img src="/assets/A large black elephant walking on the beach_4.png" width="128" /> <img src="/assets/A large black elephant walking on the beach_5.png" width="128" /> </p> <p float="left"> <img src="/assets/Eiffel tower on a desert_0.png" width="128" /> <img src="/assets/Eiffel tower on a desert_1.png" width="128" /> <img src="/assets/Eiffel tower on a desert_2.png" width="128" /> <img src="/assets/Eiffel tower on a desert_3.png" width="128" /> <img src="/assets/Eiffel tower on a desert_4.png" width="128" /> <img src="/assets/Eiffel tower on a desert_5.png" width="128" /> </p> <p float="left"> <img src="/assets/Eiffel tower on a mountain_0.png" width="128" /> <img src="/assets/Eiffel tower on a mountain_1.png" width="128" /> <img src="/assets/Eiffel tower on a mountain_2.png" width="128" /> <img src="/assets/Eiffel tower on a mountain_3.png" width="128" /> <img src="/assets/Eiffel tower on a mountain_4.png" width="128" /> <img src="/assets/Eiffel tower on a mountain_5.png" width="128" /> </p>

Quantitative Results

ModelCC3M:CLIP-score (higher is better)MS-COCO:FID-30K (lower is better)
VQGAN [2]0.20-
ImageBART [7]0.23-
DALL-E [1]-27.5
minDALL-E0.2614.7

Transfer Learning Examples

# unconditinoal image generation for imagenet (256x256)
python examples/transfer_learning_ex.py -d=configs/transfer-imagenet-uncond-gen.yaml
                                        -u=[MODEL_CKPT]
                                        -r=[RESULT_PATH]
                                        --n-gpus=[NUM_GPUS]

# class-conditinoal image generation for imagenet (256x256)
python examples/transfer_learning_ex.py -d=configs/transfer-imagenet-clscond-gen.yaml
                                        -u=[MODEL_CKPT]
                                        -r=[RESULT_PATH]
                                        --n-gpus=[NUM_GPUS]
ModelParamsFID-50K(class-cond.)FID-50K(uncond.)
VQ-GAN1.4B15.78-
ImageBART3.5B21.19-
minDALL-E1.3B15.5537.58

BibTex

If you find this repository useful in your research, please cite:

@misc{kakaobrain2021minDALL-E,
  title         = {minDALL-E on Conceptual Captions},
  author        = {Saehoon Kim, Sanghun Cho, Chiheon Kim, Doyup Lee, and Woonhyuk Baek},
  year          = {2021},
  howpublished  = {\url{https://github.com/kakaobrain/minDALL-E}},
}

References

Licenses

Contact

We hope that minDALL-E helps various projects in research-oriented institutes and startups. If you would like to collaborate with us or share a feedback, please e-mail to us, contact@kakaobrain.com

Limitations

Although minDALL-E is trained on a small set (14M image-text pairs), this might be vulnerable to malicious attacks from the prompt engineering to generate socially unacceptable images. If you obersve these images, please report the "prompt" and "generated images" to us.