Home

Awesome

Large-Scale-VRD.pytorch

alt text

<p align="center">Example results from the VG80K dataset.</p>

This is a PyTorch implementation for Large-scale Visual Relationship Understanding, AAAI2019.

This code is for the VG200 and VRD datasets only. For results on VG80K please refer to the Caffe2 implemntation.

We borrowed the framework from Detectron.pytorch for this project, so there are a lot overlaps between these two.

Benchmarking on Visual Genome

MethodBackboneSGDET@20SGDET@50SGDET@100
Frequency [1]VGG1617.723.527.6
Frequency+Overlap [1]VGG1620.126.230.1
MotifNet [1]VGG1621.427.230.3
Graph-RCNN [2]Res-10119.425.028.5
OursVGG1620.727.932.5

Note: 1) We use the frequency prior in our model by default. 2) Results of "Graph-RCNN" are directly copied from their repo.

[1] Zellers, Rowan, et al. "Neural motifs: Scene graph parsing with global context." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

[2] Yang, Jianwei, et al. "Graph r-cnn for scene graph generation." Proceedings of the European Conference on Computer Vision (ECCV). 2018.

Requirements

An easy installation if you already have Python 3 and CUDA 9.0:

conda install pytorch=0.4.1
pip install cython
pip install matplotlib numpy scipy pyyaml packaging pycocotools tensorboardX tqdm pillow scikit-image gensim
conda install opencv

Compilation

Compile the CUDA code in the Detectron submodule and in the repo:

cd $ROOT/lib
sh make.sh

Annotations

Create a data folder at the top-level directory of the repository:

# ROOT=path/to/cloned/repository
cd $ROOT
mkdir data

Visual Genome

Download it here. Unzip it under the data folder. You should see a vg folder unzipped there. It contains .json annotations that suit the dataloader used in this repo.

Visual Relation Detection

Download it here. Unzip it under the data folder. You should see a vrd folder unzipped there. It contains .json annotations that suit the dataloader used in this repo.

Word2Vec Vocabulary

Create a folder named word2vec_model under data. Download the Google word2vec vocabulary from here. Unzip it under the word2vec_model folder and you should see GoogleNews-vectors-negative300.bin there.

Images

Visual Genome

Create a folder for all images:

# ROOT=path/to/cloned/repository
cd $ROOT/data/vg
mkdir VG_100K

Download Visual Genome images from the official page. Unzip all images (part 1 and part 2) into VG_100K/. There should be a total of 108249 files.

Visual Relation Detection

Download Visual Relation Detection images from the here. Unzip it under the vrd folder and you should see train_images and val_images there. Inside them are images with cleaned file names (the original VRD images use hashes as names and we convert them to numbers).

Pre-trained Object Detection Models

Download pre-trained object detection models here. Unzip it under the root directory and you should see a detection_models folder there.

Our Trained Relationship Detection Models

Download our trained models here. Unzip it under the root folder and you should see a trained_models folder there.

Directory Structure

The final directories for data and detection models should look like:

|-- detection_models
|   |-- vg
|   |   |-- VGG16
|   |   |   |-- model_step479999.pth
|   |   |-- X-101-64x4d-FPN
|   |   |   |-- model_step119999.pth
|   |-- vrd
|   |   |-- VGG16
|   |   |   |-- model_step4499.pth
|-- data
|   |-- vg
|   |   |-- VG_100K    <-- (contains Visual Genome all images)
|   |   |-- rel_annotations_train.json
|   |   |-- rel_annotations_val.json
|   |   |-- ...
|   |-- vrd
|   |   |-- train_images    <-- (contains Visual Relation Detection training images)
|   |   |-- val_images    <-- (contains Visual Relation Detection validation images)
|   |   |-- new_annotations_train.json
|   |   |-- new_annotations_val.json
|   |   |-- ...
|   |-- word2vec_model
|   |   |-- GoogleNews-vectors-negative300.bin
|-- trained_models
|   |-- e2e_relcnn_VGG16_8_epochs_vg_y_loss_only
|   |   |-- model_step125445.pth
|   |-- e2e_relcnn_X-101-64x4d-FPN_8_epochs_vg_y_loss_only
|   |   |-- model_step125445.pth
|   |-- e2e_relcnn_VGG16_8_epochs_vrd_y_loss_only
|   |   |-- model_step7559.pth
|   |-- e2e_relcnn_VGG16_8_epochs_vrd_y_loss_only_w_freq_bias
|   |   |-- model_step7559.pth

Evaluating Pre-trained Relationship Detection models

DO NOT CHANGE anything in the provided config files(configs/xx/xxxx.yaml) even if you want to test with less or more than 8 GPUs. Use the environment variable CUDA_VISIBLE_DEVICES to control how many and which GPUs to use. Remove the --multi-gpu-test for single-gpu inference.

Visual Genome

NOTE: May require at least 64GB RAM to evaluate on the Visual Genome test set

We use three evaluation metrics for Visual Genome:

  1. SGDET: predict all the three labels and two boxes
  2. SGCLS: predict subject, object and predicate labels given ground truth subject and object boxes
  3. PRDCLS: predict predicate labels given ground truth subject and object boxes and labels

To test a trained model using a VGG16 backbone with "SGDET", run

python ./tools/test_net_rel.py --dataset vg --cfg configs/vg/e2e_relcnn_VGG16_8_epochs_vg_y_loss_only.yaml --load_ckpt trained_models/e2e_relcnn_VGG16_8_epochs_vg_y_loss_only/model_step125445.pth --output_dir Outputs/e2e_relcnn_VGG16_8_epochs_vg_y_loss_only --multi-gpu-testing --do_val

Use --use_gt_boxes option to test it with "SGCLS"; use --use_gt_boxes --use_gt_labels options to test it with "PRDCLS".

To test a trained model using a vg_X-101-64x4d-FPN backbone with "SGDET", run

python ./tools/test_net_rel.py --dataset vg --cfg configs/vg/e2e_relcnn_X-101-64x4d-FPN_8_epochs_vg_y_loss_only.yaml --load_ckpt trained_models/vg_X-101-64x4d-FPN/model_step125445.pth --output_dir Outputs/e2e_relcnn_X-101-64x4d-FPN_8_epochs_vg_y_loss_only --multi-gpu-testing --do_val

Use --use_gt_boxes option to test it with "SGCLS"; use --use_gt_boxes --use_gt_labels options to test it with "PRDCLS".

Visual Relation Detection

To test a trained model using a VGG16 backbone, run

python ./tools/test_net_rel.py --dataset vrd --cfg configs/vrd/e2e_relcnn_VGG16_8_epochs_vrd_y_loss_only.yaml --load_ckpt trained_models/e2e_relcnn_VGG16_8_epochs_vrd_y_loss_only/model_step7559.pth --output_dir Outputs/e2e_relcnn_VGG16_8_epochs_vrd_y_loss_only --multi-gpu-testing --do_val

Training Relationship Detection Models

The section provides the command-line arguments to train our relationship detection models given the pre-trained object detection models described above.

DO NOT CHANGE anything in the provided config files(configs/xx/xxxx.yaml) even if you want to train with less or more than 8 GPUs. Use the environment variable CUDA_VISIBLE_DEVICES to control how many and which GPUs to use.

With the following command lines, the training results (models and logs) should be in $ROOT/Outputs/xxx/ where xxx is the .yaml file name used in the command without the ".yaml" extension. If you want to test with your trained models, simply run the test commands described above by setting --load_ckpt as the path of your trained models.

Visual Genome

To train our relationship network using a VGG16 backbone, run

python tools/train_net_step_rel.py --dataset vg --cfg configs/vg/e2e_relcnn_VGG16_8_epochs_vg_y_loss_only.yaml --nw 8 --use_tfboard

To train our relationship network using a ResNeXt-101-64x4d-FPN backbone, run

python tools/train_net_step_rel.py --dataset vg --cfg configs/vg/e2e_relcnn_X-101-64x4d-FPN_8_epochs_vg_y_loss_only.yaml --nw 8 --use_tfboard

Visual Relation Detection

To train our relationship network using a VGG16 backbone, run

python tools/train_net_step_rel.py --dataset vrd --cfg configs/vrd/e2e_relcnn_VGG16_8_epochs_vrd_y_loss_only.yaml --nw 8 --use_tfboard

(Optional) Training Object Detection Models

This repo provides code for training object detectors for Visual Genome using a ResNeXt-101-64x4d-FPN backbone.

First download weights of ResNeXt-101-64x4d-FPN trained on COCO here. Unzip it under the data directory and you should see a detectron_model folder.

To train the object detector, run

python ./tools/train_net_step.py --dataset vg --cfg configs/e2e_faster_rcnn_X-101-64x4d-FPN_1x_vg.yaml --nw 8 --use_tfboard

The training results (models and logs) should be in $ROOT/Outputs/e2e_faster_rcnn_X-101-64x4d-FPN_1x_vg/.

Acknowledgements

This repository uses code based on the Neural-Motifs source code from Rowan Zellers, as well as code from the Detectron.pytorch repository by Roy Tseng.

Citing

If you use this code in your research, please use the following BibTeX entry.

@conference{zhang2018large,
  title={Large-Scale Visual Relationship Understanding},
  author={Zhang, Ji and Kalantidis, Yannis and Rohrbach, Marcus and Paluri, Manohar and Elgammal, Ahmed and Elhoseiny, Mohamed},
  booktitle={AAAI},
  year={2019}
}