Home

Awesome

Text2Image-for-Detection

Official Implementation for "DALL-E for Detection: Language-driven Compositional Image Synthesis for Object Detection"

Extention version: "Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation"

<div align="center"> <span><a href="https://gyhandy.github.io/"><strong>Yunhao Ge*</strong></a>,&nbsp;&nbsp;</span> <span><a href="https://cnut1648.github.io/"><strong>Jiashu Xu*</strong></a>,&nbsp;&nbsp;</span> <span><a href="https://scholar.google.com/citations?user=IhqFMeUAAAAJ&hl=en"><strong>Brian Nlong Zhao</strong></a>,&nbsp;&nbsp;</span> <span><a href="https://neelj.com/"><strong>Neel Joshi</strong></a>,&nbsp;&nbsp;</span> <span><a href="http://ilab.usc.edu/itti/"><strong>Laurent Itti</strong></a>,&nbsp;&nbsp;</span> <span><a href="https://vibhav-vineet.github.io/"><strong>Vibhav Vineet</strong></a></span> </div>

Contact: yunhaoge@usc.edu; jxu1@g.harvard.edu

Install

This project is developed using Python 3.10 and PyTorch 1.10.1 under CUDA 11.3. We recommend you to use the same version of Python and PyTorch.

pip install -r requirements.txt

Our method

<p align="center"> <img src="./assets/overview.png" alt="Arch"/> </p>

We propose a noval approach for generating diverse and large-scale pseudo-labeled training datasets, tailored specifically to enhance downstream object detection and segmentation models. We leverage text-to-image models (e.g. your favourite diffusion model) to independently generate foregrounds and backgrounds. Then we composite foregrounds onto the backgrounds, a process where we obtain the bounding boxes or segmentation masks of the foregrounds, to be used in the downstream models.

Specifically,

Usage

Data

In this project we use Pascal VOC in a low-resource regime.

You should download original dataset, e.g. Pascal VOC. Note that for Pascal we use train & Val set from the nsrom repo. The data structure will be

data
├── COCO2017 
└── voc2012
    ├── labels.txt
    ├── train_aug.txt
    ├── ...
    └── VOC2012
        ├── Annotations
        ├── ImageSets
        ...

We have k-shot selections on data/voc2012: 1 shot and 10 shot.

Diffusion Generation

The code to generate foregrounds and backgrounds are in t2i_generate/ folder. First you need to generate captions for foreground and background. Then you can use stable diffusion 2 to generate images via python stable_diffusion2.py.

Cut Paste

The code to paste foregrounds onto backgrounds are in cutpaste/ folder. We use hydra+torch lightning to run different variants. Example config files are in configs/ folder, and we include a test dataset in data/test_data/ folder. For example, you can use python paste.py exp=<exp> to launch the script, where <exp> is

You can also use viz/ to visualize the generated datasets. Simple do

python viz/viz.py <cut paste dataset dir>

This will generate 30 randomly sampled annotated images in viz/ folder.

Readers are welcome to check the config files for more parameters to control the process. Some notable mentions:

Model Training

Once the dataset is created, you can train object detection model using detection/ and instance segmentation model using instance_seg/. Both are based on the battle-tested detectron2.

For example, on VOC 2012 with 2 GPUs, you can run

# object detection
python detection/train.py -s syn \ # use synthetic data
    --syn_dir <cut paste dataset dir> \
    -t voc_val \ # test on VOC val
    --test_dir <voc dir> \ # data/voc2012/VOC2012, we need to find val set in this folder
    -g 4 \ # use 4 GPUs on 1 machine
    --freeze --data_aug --bsz 32 --epoch 200 --resnet 50 --lr 0.01 # hyperparameters

For instance segmentation, use instance_seg/seg.py instead of detection/train.py. The flags are the same.

For inference, simply apply the additional flag --eval_checkpoint <your path to the ckpt>.

Results

Our method results in significant improvement over the baseline on Pascal VOC and MS COCO, especially in the low-resource regime. We refer details in the paper.

<p align="center"> <img src="./assets/results.png" alt="Results"/> </p>