Awesome

The implementation of "Discovering Human Interactions with Novel Objects via Zero-Shot Learning", in CVPR, 2020.

ZSHOI

To update.

Getting Started

Prerequisites

Linux or macOS with Python ≥ 3.6
PyTorch ≥ 1.4, torchvision that matches the PyTorch installation.
Detectron2
Other packages listed in reuirements.txt

Installation

Please follow the instructions to install detectron2 first.
Install other dependencies by pip install -r requirements.txt or conda install --file requirements.txt
Download and prepare the data by cd datasets; sh prepare_data.sh.
- The HICO-DET dataset and V-COCO dataset.
  - If you already have, please comment out the corresponding lines in prepare_data.sh and hard-code the dataset path using your custom path in lib/data/datasets/builtin.py.
- COCO's format annotations for HICO-DET and VCOCO dataset.
- Glove semantic embeddings.

Demo Inference with Pre-trained Model

Download our pre-trained model on HICO-DET dataset or V-COCO dataset. Note: HICO-DET dataset allows 116 (excluding "no_interaction") actions and V-COCO allows 25 actions.

cd demo
# Download pre-trained model on HICO-DET dataset
sh download_pretrained_hicodet.sh
# Or download pre-trained model on V-COCO dataset
sh download_pretrained_vcoco.sh

Run demo with pre-trained model (for example, pretrained model on HICO-DET)
```
python demo.py --config-file ./configs/HICO-DET/interaction_R_50_FPN.yaml \
  --input ./demo/HICO_test2015_00003124.jpg \
  --opts MODEL.WEIGHTS ./output/hico_det_pretrained.pkl
```
- If to run demo for images in directory/*.jpg, replace --input input1.jpg input2.jpg with --input directory/*.jpg.
- If to run demo on a video, please replace --input input1.jpg input2.jpg with --video-input video.mp4.
- To save outputs to a directory (for images) or a file (for webcam or video), use --output, by default ./output/

Run demo to discover human interactions with novel (zero-shot) objects. Please incidate the interested novel objects (categories out of 80 MS-COCO objects) via command line arguments.

python demo.py --config-file ./configs/HICO-DET/interaction_zero_shot_R_50_FPN.yaml \
  --novel-object microphone paddle \
  --input ./demo/HICO_test2015_00003124.jpg \
  --opts MODEL.WEIGHTS ./output/hico_det_pretrained_agnostic.pkl

Training a model and running inference

1. Human-Object Region Proposals Network (HORPN) only

This example is provided for training the human-object region proposals network (note: not for the interacting object detection or HOI detection). HORPN is used as the first stage of our full model to generate region proposals of interacting objects. This example will train HORPN on vcoco_train_known set which includes only the images and annotations of known objects. Please hard-code the path to images and annotation files in lib/data/datasets/builtin.py before runing the code.

# To train HORPN
python train_net.py --num-gpus 2 \
  --config-file configs/VCOCO/horpn_only_R_50_FPN.yaml \
  OUTPUT_DIR ./output/vcoco_horpn_only

To run inference on vcoco_val which includes both known and novel objects.

# To run inference to evaluate HORPN. Using multiple GPUs can reduce the total inference time.
python train_net.py --eval-only --num-gpus 2 \
  --config-file configs/VCOCO/horpn_only_R_50_FPN.yaml \
  MODEL.WEIGHTS ./output/vcoco_horpn_only/model_final.pth \
  OUTPUT_DIR ./output/vcoco_horpn_only

Expected results

Inference time should around 0.069s/image (on V100 GPU)
The evaluation results of generated proposals will be listed, e.g, AR, Recall
Expected results Recall(IoU=0.5)@100 Recall(IoU=0.5)@500
Known objects 92.34 96.53
Novel objects 81.64 92.42

Expected results	Recall(IoU=0.5)@100	Recall(IoU=0.5)@500
Known objects	92.34	96.53
Novel objects	81.64	92.42

2. Interacting Object Detection

The following examples train a model to detect interacting objects. In this case, we aim to detect objects which are interacting with humans. We train the model on hico-det_train set using all 80 MS-COCO object categories.

# Interacting object detection
python train_net.py --num-gpus 2 \
  --config-file configs/HICO-DET/interacting_objects_R_50_FPN.yaml OUTPUT_DIR ./output/interacting_objects

To run inference on hico-det_test. We use COCO's metrics and APIs to conduct evaluation. Note that the ground-truth only includes interacting objects (non-interacting objects will be seen as background).

# To run inference. Using multiple GPUs can reduce the total inference time.
python train_net.py --eval-only --num-gpus 2 \
  --config-file configs/HICO-DET/interacting_objects_R_50_FPN.yaml \
  MODEL.WEIGHTS ./output/HICO_interacting_objects/model_final.pth \
  OUTPUT_DIR ./output/HICO_interacting_objects

Expected results

Inference time should around 0.074s/image (on V100 GPU)
The results of COCO's metrics will be listed, e.g, per-class Average Precision (AP)
Expected results AP AP50 AP75
Interacting objects 25.623 44.765 25.768

Expected results	AP	AP50	AP75
Interacting objects	25.623	44.765	25.768

3. HOI Detection

The following examples train a model to detect human-object interactions using hico-det_train set. Here we use all 80 MS-COCO object categories.

# Interacting object detection
python train_net.py --num-gpus 2 \
  --config-file configs/HICO-DET/interaction_R_50_FPN.yaml OUTPUT_DIR ./output/HICO_interaction

To run inference on hico-det_test. This code will trigger the official HICO-DET MATLAB evaluation. Please make sure MATLAB is available in your machine and check the hard-coded path cfg.TEST.HICO_OFFICIAL_ANNO_FILE and cfg.TEST.HICO_OFFICIAL_BBOX_FILE can direct to the original HICO-DET annotation files.

# To run inference. Using multiple GPUs can reduce the total inference time.
python train_net.py --eval-only --num-gpus 2 \
  --config-file configs/HICO-DET/interaction_R_50_FPN.yaml \
  MODEL.WEIGHTS ./output/interaction_R_50_FPN.yaml/model_final.pth \
  OUTPUT_DIR ./output/interaction_R_50_FPN.yaml

Expected results

Inference time should around 0.0766s/image (on V100 GPU).
It will list the results of COCO's metrics on interacting object detection as above.
The results of HICO-DET's metrics will be listed, e.g,

Expected results full rare non-rare
Default mAP 20.096 14.969 21.628
Default mAP (excluding "no_interaction") 23.188 15.650 25.753

Note: This result is better than the result reported in our paper due to some code optimization.

Expected results	full	rare	non-rare
Default mAP	20.096	14.969	21.628
Default mAP (excluding "no_interaction")	23.188	15.650	25.753

3. Zero-Shot HOI Detection

The following examples train a model to detect human interactions with novel objects using hico-det_train set. The only difference from the above is to use a class agnostic bbox regressor here.

# Interacting object detection
python train_net.py --num-gpus 2 \
  --config-file configs/HICO-DET/interaction_zero_shot_R_50_FPN.yaml \
  OUTPUT_DIR ./output/HICO_interaction_zero_shot

To run inference, you can specify the interested novel object categories by --novel-object object1 object2 object3.

  python demo.py \
    --config-file ./configs/HICO-DET/interaction_zero_shot_R_50_FPN.yaml \
    --novel-object microphone paddle \
    --input ./demo/HICO_test2015_00003124.jpg \
    --opts MODEL.WEIGHTS ./output/hico_det_pretrained.pkl

Known/Novel Split - 80 MS-COCO Objects

To simulate the zero-shot cases, we split the 80 object categories into a known and novel set based on their occurrence frequency in HICO-DET and VCOCO datasets. The split can be found at datasets/known_novel_split.py.

Citing

If you use this code in your research or wish to refer to the baseline results published, please use the following BibTeX.

@InProceedings{Wang_2020_CVPR,
author = {Wang, Suchen and Yap, Kim-Hui and Yuan, Junsong and Tan, Yap-Peng},
title = {Discovering Human Interactions with Novel Objects via Zero-Shot Learning},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}

License

This project is licensed under the MIT License - see the LICENSE.md file for details