Home

Awesome

Visual-Semantic Graph Attention Network for Human-Object Interaction Detecion

<!---------------------------------------------------------------------------------------------------------------->

Official Pytorch implementation for Visual-Semantic Graph Attention Network for Human-Object Interaction Detecion.

Preamble

A) Generally, HOI detection includes two steps: <b>Object detection</b> && <b>Interaction Inference</b>.

B) As for Interaction Inference, many previous works <b>mainly focused on features of the human as well as the directly interacted object</b>.

C) Our insights: Not only the <b>primary relations</b> but also the <b>subsidiary relations</b> will provide significant cues to do intercation inference: <b>Contextual Information</b>.

<!---------------------------------------------------------------------------------------------------------------->

VS-GATs

we study the disambiguating power of subsidiary scene relations via a <b>double Graph Attention Network</b> that aggregates <b>visual-spatial, and semantic information</b> in parallel. The network uses attention to leverage primary and subsidiary contextual cues to gain additional disambiguating power. <img align='center' src='./assets/overview.png' width='1000'>

<b>Visual-Semantic Graph Attention Network</b>: After instance detection, a visual-spatial and a semantic graph are created. Node edge weights are dynamically through attention. We combine these graphs and then perform a readout step on box-pairs to infer all possible predicates between one subject and one object.

<!---------------------------------------------------------------------------------------------------------------->

Graph

Preliminary

A graph $G$ is defined as $G=(V, E)$ that consists of a set of $V$ nodes and a set of $E$ edges. Node features and edge features are denoted by $\mathbf{h}_v$ and $\mathbf{h}e$ respectively. Let $v_i \in V$ be the $ith$ node and $e{i,j}=(v_i,v_j) \in E$ be the directed edge from $v_i$ to $v_j$.

A graph with $n$ nodes and $m$ edges has a node features matrix $\mathbf{X}v \in \mathbf{R}^{n \times d}$ and an edge feature matrix $\mathbf{X}e \in \mathbf{R}^{m \times c}$ where $\mathbf{h}{v_i} \in \mathbf{R}^d$ is the feature vector of node $i$ and $\mathbf{h}{e_{i,j}} \in \mathbf{R}^c$ is the feature vector of edge $(i,j)$. Fully connected edges imply $e_{i,j} \neq e_{j,i}$.

DGL

<a href='https://docs.dgl.ai/en/latest/install/index.html'>DGL</a> is a Python package dedicated to deep learning on graphs, built atop existing tensor DL frameworks (e.g. Pytorch, MXNet) and simplifying the implementation of graph-based neural networks.

<!---------------------------------------------------------------------------------------------------------------->

Code Overview

In this project, we implement our method using the Pytorch and DGL library and there are three main folders:

In the following, we briefly introduce some main scripts.

datasets/

model/

result/

others

<!---------------------------------------------------------------------------------------------------------------->

Getting Started

Prerequisites

Installation

  1. Clone this repository.

    git clone https://github.com/BIGJUN777/VS-GATs.git
    
  2. Install Python dependencies:

    pip install -r requirements.txt
    

Prepare Data

Download original data

  1. Download the original HICO-DET dataset and put it into datasets/hico.
  2. Follow here to prepare the original data of V-COCO dataset in datasets/vcoco folder.
  3. Download the pretrain word2vec model on GoogleNews and put it into datasets/word2vec

Process Data

  1. You can directly download our processed data from HICO-DET (password: 3rax) and V-COCO (password: bfad) and extract them with the original file name into datasets/processed.

  2. If you want to finish it by yourself, you first need to copy utils/generalized_rcnn.py and roi_heads.py these two files to the folder including the implemented Faster R-CNN(e.g. ~/anaconda3/envs/py3_test/lib/python3.6/site-packages/torchvision/models/detection/) because we slightly modify the related code to save the desired visual features(We recommend you to backup the origial code first). And you can run the following commend:

    bash datasets/hico_process.sh
    bash datasets/vcoco_process.sh
    

Training

Testing

Results

Acknowledgement

In this project, some codes which process the data and eval the model are built upon ECCV2018-Learning Human-Object Interactions by Graph Parsing Neural Networks and ICCV2019-No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques. Thanks them for their great works.