Awesome

Compositional Learning for Human-Object Interaction Detection

This repository includes the code of

Visual Compositional Learning for Human-Object Interaction Detection (ECCV2020)

Detecting Human-Object Interaction via Fabricated Compositional Learning (CVPR2021)

Affordance Transfer Learning for Human-Object Interaction Detection (CVPR2021)

Discovering Human-Object Interaction Concepts via Self-Compositional Learning (ECCV2022)

avatar

If you have any questions, feel free to create issues or contact zhou9878 [at] uni dot sydney dot edu dot au.

Prerequisites

This codebase was developed and tested with Python3.7, Tensorflow 1.14.0, Matlab (for evaluation), CUDA 10.0 and Centos 7

Installation

Download HICO-DET dataset. Setup V-COCO and COCO API. Setup HICO-DET evaluation code.
```
chmod +x ./misc/download_dataset.sh 
./misc/download_dataset.sh 
```
Install packages by pip.
```
pip install -r requirements.txt
```
Download COCO pre-trained weights and training data
```
chmod +x ./misc/download_training_data.sh 
./misc/download_training_data.sh
```
Due to the limitation of space in my google drive, Additional files for ATL are provided in CloudStor.

VCL

See GETTING_STARTED_VCL.md. You can also find details in https://github.com/zhihou7/VCL.

FCL

See GETTING_STARTED_FCL.md

ATL

See GETTING_STARTED_ATL.md. We try to use a HOI model to recognize object affordance, i.e. what actions can be applied to an object.

SCL

See GETTING_STARTED_SCL.md. We propose a novel task, HOI Concept Discovery, to discover reasonable concepts/categories for HOI detection.

Experiment Results

Long-tailed HOI detection

mAP on HICO-DET (Default)

Model	Full	Rare	Non-Rare
VCL	23.63	17.21	25.55
FCL	24.68	20.03	26.07
VCL+FCL	25.27	20.57	26.67
VCL <sup>GT</sup>	43.09	32.56	46.24
FCL<sup>GT</sup>	44.26	35.46	46.88
(VCL+FCL)<sup>GT</sup>	45.25	36.27	47.94

Here, VCL+FCL is the fusion of the two model predictions, which illustrates the complementary between the two lines of work. Table 12 in ATL also illustrates the differences between VCL and FCL. We also tried to directly train the network with the two methods. However, it is worse than the score fusion result.

Zero-Shot HOI detection

Compositional Zero-Shot

Model	Unseen	Seen	Full
VCL(rare first)	10.06	24.28	21.43
FCL(rare first)	13.16	24.23	22.01
ATL (rare first)	9.18	24.67	21.57
Qpic	15.24	30.44	27.40
SCL+Qpic (rare first)	19.07	30.39	28.08
VCL(non-rare first)	16.22	18.52	18.06
FCL(non-rare first)	18.66	19.55	19.37
ATL (non-rare first)	18.25	18.78	18.67
Qpic	21.03	23.73	23.19
SCL+Qpic (rare first)	21.73	25.00	24.34

Novel Object Zero-Shot

Model	Unseen	Seen	Full
ATL (HICO-DET)	11.35	20.96	19.36
FCL	15.54	20.74	19.87
ATL (COCO)	15.11	21.54	20.47
ATL (HICO-DET)*	0.00	13.67	11.39
FCL*	0.00	13.71	11.43
ATL (COCO)*	5.05	14.69	13.08

* means we remove the object identity information from the detector and only use the boxes.

UC is compositional zero-shot HOI detection. UO means novel object zero-shot HOI detection.

Experimently, FCL achieves better zero-shot performance on compositional zero-shot HOI detection. Under novel object zero-shot HOI detection, ATL is more suitable.

Object Affordance Recognition

Here, we provides the result reported by AP.

Method	HOI Data	Object	Val2017	Object365	HICO-DET	Novel classes
Baseline	HOI	-	31.91	26.16	44.00	14.27
FCL	HOI	-	41.89	32.20	55.95	18.84
VCL	HOI	HOI	76.43	69.04	86.89	32.36
ATL	HOI	HOI	76.52	69.27	87.20	34.20
ATL	HOI	COCO	90.84	85.83	92.79	36.28

Baseline	HICO	-	19.71	17.86	23.18	6.80
FCL	HICO	-	25.11	25.21	37.32	6.80
VCL	HICO	HICO	36.74	35.73	43.15	12.05
ATL	HICO	HICO	52.01	50.94	59.44	15.64
ATL	HICO	COCO	56.05	40.83	57.41	8.52
SCL	HICO	HICO	72.08	57.53	82.47	18.55

ATL<sup>ZS</sup>	HICO	HICO	24.21	20.88	28.56	12.26
ATL<sup>ZS</sup>	HICO	COCO	35.55	31.77	39.45	13.25

Data & Model

Data

We present the differences between different detector in our paper and analyze the effect of object boxes on HOI detection. VCL detector and DRG detector can be download from the corresponding paper. Due to the space limitation of Google Drive, there are many files provided in CloudStor. Many thanks to CloudStor and The University of Sydney. Here, we provide the GT boxes.

GT boxes annotation: https://drive.google.com/file/d/15UXbsoverISJ9wNO-84uI4kQEbRjyRa8/view?usp=sharing

FCL was finished about 10 months ago. In the first submission, we compare the difference among COCO detector, Fine-tuned Detector and GT boxes. We further find DRG object detector largely increases the baseline. All these comparisons illustrate the significant effect of object detector on HOI. That's really necessary to provide the performance of object detector.

HOI-COCO training data: https://cloudstor.aarnet.edu.au/plus/s/6NzReMWHblQVpht

Please notice train2017 might contain part of V-COCO test data. Thus, we just use train2014 in our experiment. If we use train2017, the result might be better (improve about 0.5%). We think that is the case: we have localized objects, but we do not know the interaction.

See DATA.md to obtain more test and training data.

Pre-trained Models

See MODEL.md

Qustion & Answer

Thanks for all reviewer's comments.

FCL inspires a lot to combine the object feature from object dataset and verb feature (i.e. ATL on HOI detection). ATL further gives a lot of insight to us in HOI understanding. We are still working on the relationship between HOI understanding and object understanding, the compositionality of HOI.

Different Object Detectors

We illustrate the difference between DRG boxes and VCL boxes in Table 16 in FCL. The recall of FCL with DRG box is nearly similar to GT boxes (62.07 (VCL), 82.81(DRG), 86.08(GT) respectively).

Human Box Verb

Following VCL, We extract the verb representation from union box in ATL. However, we find the verb representation (union box vs human box) has an important effect on affordance recognition. Please see https://github.com/zhihou7/HOI-CL/issues/1. Thanks for this issue. We'll update appendix in the version of arxiv.

Verb vs HOI prediction.

Currently, some approaches predict verbs for HOI detection, while here we directly predict HOI categories. Experimentally, we find the performance of object affordance recognition apparently drops on HOI-COCO, while it achieves a similar result when we directly predict HOI categories or verb categories.

Citations

If you find this series of work are useful for you, please consider citing:

@inproceedings{hou2022scl,
  title={Discovering Human-Object Interaction Concepts via Self-Compositional Learning},
  author={Hou, Zhi and Yu, Baosheng and Tao, Dacheng},
  booktitle={ECCV},
  year={2022}
}

@inproceedings{hou2021fcl,
  title={Detecting Human-Object Interaction via Fabricated Compositional Learning},
  author={Hou, Zhi and Yu, Baosheng and Qiao, Yu and Peng, Xiaojiang and Tao, Dacheng},
  booktitle={CVPR},
  year={2021}
}

@inproceedings{hou2021vcl,
  title={Visual Compositional Learning for Human-Object Interaction Detection},
  author={Hou, Zhi and Peng, Xiaojiang and Qiao, Yu  and Tao, Dacheng},
  booktitle={ECCV},
  year={2020}
}

@inproceedings{hou2021atl,
  title={Affordance Transfer Learning for Human-Object Interaction Detection},
  author={Hou, Zhi and Yu, Baosheng and Qiao, Yu and Peng, Xiaojiang and Tao, Dacheng},
  booktitle={CVPR},
  year={2021}
}

Acknowledgement

Codes are built upon Visual Compositional Learning for Human-Object Interaction Detection, iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection, Transferable Interactiveness Network, tf-faster-rcnn.

Thanks for all reviewer's comments, e.g. object feature illustration by t-SNE figure, which shows the fabricated objects are a bit different from real object features.