Home

Awesome

Disentangled Pre-training for Human-Object Interaction Detection

Zhuolong Li<sup>*</sup>, Xingao Li<sup>*</sup>, Changxing Ding, Xiangmin Xu

The paper is accepted to CVPR2024.

<div align="center"> <img src="paper_images/overview_dphoi.png" width="900px" /> </div>

Preparation

Environment

  1. Install the dependencies.
pip install -r requirements.txt
  1. Clone and build CLIP.
git clone https://github.com/openai/CLIP.git && cd CLIP && python setup.py develop && cd ..

Dataset

  1. HAA500 dataset

  Download the Haa500_v1_1 dataset and unzip it to the DP-HOI/data/action folder.

  Run pre_haa500.py.

python ./pre_datasets/pre_haa500.py
  1. Kinetics700 dataset

  Download the Kinetics700 dataset and unzip it to the DP-HOI/data/action folder.

  Run pre_kinetics700.py.

python ./pre_datasets/pre_kinetics700.py
  1. Flickr30k dataset

  Download the Flickr30k dataset from the following URL and directly unzip it to the DP-HOI/data/caption folder.

  1. VG dataset

  Download the VG dataset from the following URL and directly unzip it to the DP-HOI/data/caption folder.

  Download and unzip the processed annotations.zip to DP-HOI/data/caption/annotations folder.

  1. Objects365 dataset

  Download the Objects365 dataset from the following URL and directly unzip it to the DP-HOI/data/datection folder.

  1. COCO dataset

  Download the COCO dataset from the following URL and directly unzip it to the DP-HOI/data/datection folder.

  Download and move the processed coco_objects365_200k.json to the DP-HOI/data/detection/annotations folder.

When you have completed the above steps, the pre-training dataset structure is:

DP-HOI
 |─ data
 |   └─ action
 |     └─ haa500  
 |       |─ annotations
 |       |   |─ train_haa500.json
 |       |─ images
 |       |─ videos
 |     └─ kinetics-700  
 |       |─ annotations
 |       |   |─ train_kinetics700.json
 |       |─ images
 |       |─ videos

 |   └─ caption
 |     └─ annotations
 |       |─ Flickr30k_VG_cluster_dphoi.json
 |       |─ triplets_category.txt
 |       |─ triplets_features.pth
 |     └─ Flickr30k  
 |       |─ images
 |     └─ VG
 |       |─ images
 
 |   └─ detection
 |     └─ annotations
 |       |─ coco_objects365_200k.json
 |     └─ coco  
 |       |─ images
 |       |─ annotations
 |       |   |─ instances_val2017.json
 |     └─ objects365
 |       |─ images

Initial parameters

To speed up the pre-training process, consider using DETR's pre-trained weights for initialization. Download the pretrained model of DETR detector for ResNet50 , and put it to the params directory.

Pre-training

After the preparation, you can start training with the following commands.

sh ./scripts/pretrain/train.sh

Fine-tuning

After pre-training, you can start fine-tuning with the following commands. An example of fine-tuning on HOICLIP is provided below.

python ./tools/convert_parameters.py \
        --finetune_model hoiclip \
        --load_path params/dphoi_res50_3layers.pth \
        --save_path params/dphoi_res50_hico_hoiclip.pth \
        --dataset hico \
        --num_queries 64 
sh ./scripts/finetune/hoiclip/train_hico.sh

Pre-trained model

You can also directly download the pre-trained model of DP-HOI for ResNet50.

Results

HICO-DET

Full (D)Rare (D)Non-rare (D)ModelConfig
ours (UPT)33.3628.7434.75modelconfig
ours (PViC)35.7732.2636.81modelconfig
ours (CDN-S<sup></sup>)35.0032.3835.78modelconfig
ours (CDN-S<sup></sup>+CCS<sup>*</sup>)35.3834.6135.61modelconfig
ours (HOICLIP)36.5634.3637.22modelconfig

D: Default, †: DN strategy from DN-DETR, *: data augmentation strategy from DOQ. The weights fine-tuned on HICO-DET for two-stage methods (e.g., UPT and PViC) can be download here.

V-COCO

Scenario 1ModelConfig
ours (GEN<sub>s</sub>)66.6modelconfig

Zero-shot HOI Detection Results

TypeUnseenSeenFullModelConfig
ours (HOICLIP)UV26.3034.4933.34modelconfig
ours (HOICLIP)RF-UC30.4936.1735.03modelconfig
ours (HOICLIP)NF-UC28.8729.9829.76modelconfig

Citation

Please consider citing our paper if it helps your research.

@inproceedings{disentangled_cvpr2024,
author = {Zhuolong Li,Xingao Li,Changxing Ding,Xiangmin Xu},
title = {Disentangled Pre-training for Human-Object Interaction Detection},
booktitle={CVPR},
year = {2024},
}

Acknowledgement

Codes are built from DETR, DN-DETR, CLIP. We thank them for their contributions.