Home

Awesome

large_vocabulary_hoi_detection

Code for ICCV2021: Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection [paper]

Assembled Datasets

SWiG-HOI

Note: The assembled SWiG-HOI is based on the SWiG dataset. Please refer to this repo to download the dataset. The processed annotations can be downloaded from here.

Find more information in this page assembled_datasets.md.

Getting Started

Installation

This code is built on Detectron2. Note that you don't need to build detectron2 seperately as this codebase is self-contained. You can follow the instructions below to install the dependencies and build this repo.

conda create -n choir python=3.7
conda activate choir
# Install Pytorch >= 1.7
conda install pytorch==1.7.0 torchvision==0.8.0 cudatoolkit=10.2 -c pytorch
# Timm (https://github.com/rwightman/pytorch-image-models)
pip install timm
conda install scipy matplotlib
# fvcore
pip install 'git+https://github.com/facebookresearch/fvcore'
# install LVIS API, for large vocabulary object detection evaluation
pip install lvis pycocotools
# Build this repo
python setup.py build develop

Inference Demo with Pre-trained Models

Pick a model and its config file from model zoo, for example, using configs/swig_hoi/hoir_swin.yaml and this model. Run it with:

python tools/demo.py --config-file configs/swig_hoi/hoir_swin.yaml \
  --input figures/example*.jpg \
  --confidence-threshold 0.2 \
  --opts MODEL.WEIGHTS path-to-model-weights

Training & Evaluation in Command Line

To train a model, run

python tools/train_net.py --num-gpus 8 \
    --config-file configs/swig_hoi/hoir_swin.yaml

To evaluate the trained models, run

python tools/test_net.py --num-gpus 1 \
    --config-file configs/swig_hoi/hoir_swin.yaml \
    --eval-only

Citing

If you find this repository useful, please consider giving a citation

@InProceedings{Wang_2021_ICCV,
    author    = {Wang, Suchen and Yap, Kim-Hui and Ding, Henghui and Wu, Jiyan and Yuan, Junsong and Tan, Yap-Peng},
    title     = {Discovering Human Interactions With Large-Vocabulary Objects via Query and Multi-Scale Detection},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year      = {2021},
}

If you find the SWiG and 100DOH datasets are useful, please also cite their papers

@article{Pratt2020Swig,
title={Grounded Situation Recognition},
author={Sarah Pratt and Mark Yatskar and Luca Weihs and Ali Farhadi and Aniruddha Kembhavi},
journal={ArXiv},
year={2020},
volume={abs/2003.12058}
}
@INPROCEEDINGS{Shan20, 
    author = {Shan, Dandan and Geng, Jiaqi and Shu, Michelle  and Fouhey, David},
    title = {Understanding Human Hands in Contact at Internet Scale},
    booktitle = {CVPR}, 
    year = {2020} 
}

TODO