


Official Implementation for "DALL-E for Detection: Language-driven Compositional Image Synthesis for Object Detection"

Extention version: "Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation"

<div align="center"> <span><a href="https://gyhandy.github.io/"><strong>Yunhao Ge*</strong></a>,&nbsp;&nbsp;</span> <span><a href="https://cnut1648.github.io/"><strong>Jiashu Xu*</strong></a>,&nbsp;&nbsp;</span> <span><a href="https://scholar.google.com/citations?user=IhqFMeUAAAAJ&hl=en"><strong>Brian Nlong Zhao</strong></a>,&nbsp;&nbsp;</span> <span><a href="https://neelj.com/"><strong>Neel Joshi</strong></a>,&nbsp;&nbsp;</span> <span><a href="http://ilab.usc.edu/itti/"><strong>Laurent Itti</strong></a>,&nbsp;&nbsp;</span> <span><a href="https://vibhav-vineet.github.io/"><strong>Vibhav Vineet</strong></a></span> </div>

Contact: yunhaoge@usc.edu; jxu1@g.harvard.edu


This project is developed using Python 3.10 and PyTorch 1.10.1 under CUDA 11.3. We recommend you to use the same version of Python and PyTorch.

pip install -r requirements.txt

Our method

<p align="center"> <img src="./assets/overview.png" alt="Arch"/> </p>

We propose a noval approach for generating diverse and large-scale pseudo-labeled training datasets, tailored specifically to enhance downstream object detection and segmentation models. We leverage text-to-image models (e.g. your favourite diffusion model) to independently generate foregrounds and backgrounds. Then we composite foregrounds onto the backgrounds, a process where we obtain the bounding boxes or segmentation masks of the foregrounds, to be used in the downstream models.




In this project we use Pascal VOC in a low-resource regime.

You should download original dataset, e.g. Pascal VOC. Note that for Pascal we use train & Val set from the nsrom repo. The data structure will be

├── COCO2017 
└── voc2012
    ├── labels.txt
    ├── train_aug.txt
    ├── ...
    └── VOC2012
        ├── Annotations
        ├── ImageSets

We have k-shot selections on data/voc2012: 1 shot and 10 shot.

Diffusion Generation

The code to generate foregrounds and backgrounds are in t2i_generate/ folder. First you need to generate captions for foreground and background. Then you can use stable diffusion 2 to generate images via python stable_diffusion2.py.

Cut Paste

The code to paste foregrounds onto backgrounds are in cutpaste/ folder. We use hydra+torch lightning to run different variants. Example config files are in configs/ folder, and we include a test dataset in data/test_data/ folder. For example, you can use python paste.py exp=<exp> to launch the script, where <exp> is

You can also use viz/ to visualize the generated datasets. Simple do

python viz/viz.py <cut paste dataset dir>

This will generate 30 randomly sampled annotated images in viz/ folder.

Readers are welcome to check the config files for more parameters to control the process. Some notable mentions:

Model Training

Once the dataset is created, you can train object detection model using detection/ and instance segmentation model using instance_seg/. Both are based on the battle-tested detectron2.

For example, on VOC 2012 with 2 GPUs, you can run

# object detection
python detection/train.py -s syn \ # use synthetic data
    --syn_dir <cut paste dataset dir> \
    -t voc_val \ # test on VOC val
    --test_dir <voc dir> \ # data/voc2012/VOC2012, we need to find val set in this folder
    -g 4 \ # use 4 GPUs on 1 machine
    --freeze --data_aug --bsz 32 --epoch 200 --resnet 50 --lr 0.01 # hyperparameters

For instance segmentation, use instance_seg/seg.py instead of detection/train.py. The flags are the same.

For inference, simply apply the additional flag --eval_checkpoint <your path to the ckpt>.


Our method results in significant improvement over the baseline on Pascal VOC and MS COCO, especially in the low-resource regime. We refer details in the paper.

<p align="center"> <img src="./assets/results.png" alt="Results"/> </p>