Awesome
Disentangled Pre-training for Human-Object Interaction Detection
Zhuolong Li<sup>*</sup>, Xingao Li<sup>*</sup>, Changxing Ding, Xiangmin Xu
The paper is accepted to CVPR2024.
<div align="center"> <img src="paper_images/overview_dphoi.png" width="900px" /> </div>Preparation
Environment
- Install the dependencies.
pip install -r requirements.txt
- Clone and build CLIP.
git clone https://github.com/openai/CLIP.git && cd CLIP && python setup.py develop && cd ..
Dataset
- HAA500 dataset
Download the Haa500_v1_1 dataset and unzip it to the DP-HOI/data/action
folder.
Run pre_haa500.py
.
python ./pre_datasets/pre_haa500.py
- Kinetics700 dataset
Download the Kinetics700 dataset and unzip it to the DP-HOI/data/action
folder.
Run pre_kinetics700.py
.
python ./pre_datasets/pre_kinetics700.py
- Flickr30k dataset
Download the Flickr30k dataset from the following URL and directly unzip it to the DP-HOI/data/caption
folder.
- VG dataset
Download the VG dataset from the following URL and directly unzip it to the DP-HOI/data/caption
folder.
Download and unzip the processed annotations.zip to DP-HOI/data/caption/annotations
folder.
- Objects365 dataset
Download the Objects365 dataset from the following URL and directly unzip it to the DP-HOI/data/datection
folder.
- COCO dataset
Download the COCO dataset from the following URL and directly unzip it to the DP-HOI/data/datection
folder.
Download and move the processed coco_objects365_200k.json to the DP-HOI/data/detection/annotations
folder.
When you have completed the above steps, the pre-training dataset structure is:
DP-HOI
|─ data
| └─ action
| └─ haa500
| |─ annotations
| | |─ train_haa500.json
| |─ images
| |─ videos
| └─ kinetics-700
| |─ annotations
| | |─ train_kinetics700.json
| |─ images
| |─ videos
| └─ caption
| └─ annotations
| |─ Flickr30k_VG_cluster_dphoi.json
| |─ triplets_category.txt
| |─ triplets_features.pth
| └─ Flickr30k
| |─ images
| └─ VG
| |─ images
| └─ detection
| └─ annotations
| |─ coco_objects365_200k.json
| └─ coco
| |─ images
| |─ annotations
| | |─ instances_val2017.json
| └─ objects365
| |─ images
Initial parameters
To speed up the pre-training process, consider using DETR's pre-trained weights for initialization.
Download the pretrained model of DETR detector for ResNet50 , and put it to the params
directory.
Pre-training
After the preparation, you can start training with the following commands.
sh ./scripts/pretrain/train.sh
Fine-tuning
After pre-training, you can start fine-tuning with the following commands. An example of fine-tuning on HOICLIP is provided below.
python ./tools/convert_parameters.py \
--finetune_model hoiclip \
--load_path params/dphoi_res50_3layers.pth \
--save_path params/dphoi_res50_hico_hoiclip.pth \
--dataset hico \
--num_queries 64
sh ./scripts/finetune/hoiclip/train_hico.sh
Pre-trained model
You can also directly download the pre-trained model of DP-HOI for ResNet50.
Results
HICO-DET
Full (D) | Rare (D) | Non-rare (D) | Model | Config | |
---|---|---|---|---|---|
ours (UPT) | 33.36 | 28.74 | 34.75 | model | config |
ours (PViC) | 35.77 | 32.26 | 36.81 | model | config |
ours (CDN-S<sup>†</sup>) | 35.00 | 32.38 | 35.78 | model | config |
ours (CDN-S<sup>†</sup>+CCS<sup>*</sup>) | 35.38 | 34.61 | 35.61 | model | config |
ours (HOICLIP) | 36.56 | 34.36 | 37.22 | model | config |
D: Default, †: DN strategy from DN-DETR, *: data augmentation strategy from DOQ. The weights fine-tuned on HICO-DET for two-stage methods (e.g., UPT and PViC) can be download here.
V-COCO
Scenario 1 | Model | Config | |
---|---|---|---|
ours (GEN<sub>s</sub>) | 66.6 | model | config |
Zero-shot HOI Detection Results
Type | Unseen | Seen | Full | Model | Config | |
---|---|---|---|---|---|---|
ours (HOICLIP) | UV | 26.30 | 34.49 | 33.34 | model | config |
ours (HOICLIP) | RF-UC | 30.49 | 36.17 | 35.03 | model | config |
ours (HOICLIP) | NF-UC | 28.87 | 29.98 | 29.76 | model | config |
Citation
Please consider citing our paper if it helps your research.
@inproceedings{disentangled_cvpr2024,
author = {Zhuolong Li,Xingao Li,Changxing Ding,Xiangmin Xu},
title = {Disentangled Pre-training for Human-Object Interaction Detection},
booktitle={CVPR},
year = {2024},
}
Acknowledgement
Codes are built from DETR, DN-DETR, CLIP. We thank them for their contributions.