Home

Awesome

[ECCV' 24 Oral] CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

[Paper] [Poster] [Video Presentation]

πŸ“Œ This is an official PyTorch implementation of CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection<br> Wuyang Li<sup>1</sup>, Xinyu Liu<sup>1</sup>, Jiayi Ma<sup>2</sup>, Yixuan Yuan<sup>1</sup><br><sup>1</sup> The Chinese Univerisity of Hong Kong; <sup>2</sup> Wuhan University

<div align="center"> <img width="100%" alt="CLIFF overview" src="assets/cliff_main.png"/> </div>

Contact: wymanbest@outlook.com

πŸ“’ News

✨ Highlight

CLIFF is a probabilistic pipeline modeling the distribution transition among the object, CLIP image, and text subspaces with continual diffusion. Our contributions can be divided into the following aspects:

πŸ› οΈ Installation

# clone the repo
git clone https://github.com/CUHK-AIM-Group/CLIFF.git

# conda envs
conda create -n cliff python=3.9 -y
conda activate cliff

# [Optionally] check your cuda version and modify accordingly
export CUDA_HOME=/usr/local/cuda-11.3
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

# install pre-built detectron2
python -m pip install detectron2 -f \
  https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

# install dependence
pip install -r requirements.txt

πŸ“‚ Dataset Preparation

Please follow the steps in DATASETS.md to prepare the dataset.

Then, change the dataset root _root=$Your dataset root in ovd/datasets/coco_zeroshot.py, and change the path in all yaml files configs/xxx.yaml accordingly.

πŸš€ Minimum Viable Product Version: Classification with CLIFF

We release an MVP version of CLIFF for Cifar-10 classification in the folder cifar10, which is much more user-friendly and cute. The general idea is to generate CLIP text embedding with the same diffusor used in CLIFF and measure the Euclidean distance between the generated embedding and CLIP embedding to make the class decision. You can directly transfer this simple version to your project. The code reference is https://github.com/kuangliu/pytorch-cifar.

Feature ExtractorBaselineOurs
ResNet-1893.02%95.00%

You can follow the following steps to train and evaluate the MVP code.

cd cifar10

python train_net_diffusion.py \
    --lr 0.1 \
    --dir ./experiments

πŸ“Š Evaluation: Detection with CLIFF

It's worth noting that there may be around a 1.0 $\text{mAP}_n$ (IoU@50) difference in each evaluation using different random seeds. This is because, in reverse diffusion, the noise is randomly sampled and slightly affects class decision-making. This phenomenon is similar in generative tasks, where using different seeds results in different outputs. The model link for OVD COCO can be found below,

<table> <tr> <th>SEED </th> <th>mAP<sub>n</sub></th> <th>mAP<sub>b</sub></th> <th>mAP</th> <th>Link</th> </tr> <tr> <td>9</td> <td>43.07</td> <td>54.53</td> <td>51.54</td> <td rowspan="3"><a href="https://1drv.ms/u/s!AphTD4EdTQo1hPZMDDcyvOwvhJWkrQ?e=CBpvqE">Onedrive</a></td> </tr> <tr> <td>99</td> <td>43.36</td> <td>54.39</td> <td>51.51</td> </tr> <tr> <td>999</td> <td>43.44</td> <td>54.44</td> <td>51.51</td> </tr> </table>

To evaluate the model with 4 GPUs, use the following commands. You can change CUDA_VISIBLE_DEVICES and num-gpus to use a different number of GPUs:

CUDA_VISIBLE_DEVICES=1,2,3,4 python train_net_diffusion.py \
    --num-gpus 4 \
    --config-file /path/to/config/name.yaml \
    --eval-only \
    MODEL.WEIGHTS /path/to/weight.pth \
    SEED 999

For example:

CUDA_VISIBLE_DEVICES=1,2 python train_net_diffusion.py \
    --num-gpus 2 \
    --config-file configs/CLIFF_COCO_RCNN-C4_obj2img2txt_stage2.yaml \
    --eval-only \
    MODEL.WEIGHTS ./cliff_model_coco_ovd.pth \
    SEED 999

πŸ‹οΈ Train: Detection with CLIFF

Since I recently changed my working position, I no longer have access to the original GPU resources. We are currently double-checking the training process on other machines after cleaning up the code. We will release the training code as soon as possible. Nonetheless, we have provided the unverified script for the train.sh.

πŸ™ Acknowledgement

Greatly appreciate the tremendous effort for the following projects!

πŸ“‹ TODO List

πŸ“šCiteation

If you think our work is helpful for your project, I would greatly appreciate it if you could consdier citing our work

@inproceedings{Li2024cliff,
    title={CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection},
    author={Wuyang Li, Xinyu Liu, Jiayi Ma, Yixuan Yuan},
    booktitle={ECCV},
    year={2024}
}
<!-- <br>Contact: [Wuyang Li](https://wymancv.github.io/wuyang.github.io/) -->