Awesome

Composite Relationship Fields with Transformers for Scene Graph Generation

This codebase is the official implementation of "Composite Relationship Fields with Transformers for Scene Graph Generation" (Accepted at WACV2023).

pull figure

Scene graph generation (SGG) methods aim to extract a structured semantic representation of a scene by detecting the objects present and their relationships. While most methods focus on improving top-down approaches, which build a scene graph based on predicted objects from an off-the-shelf object detector, there is a limited amount of work on bottom-up approaches, which directly predict objects and their relationships in a single stage.

In this work, we present a novel bottom-up SGG approach by representing relationships using Composite Relationship Fields (CoRF). CoRF turns relationship detection into a dense regression and classification task, where each cell of the output feature map identifies surrounding objects and their relationships. Furthermore, we propose a refinement head that leverages Transformers for global scene reasoning, resulting in more meaningful relationship predictions. By combining both contributions, our method outperforms previous bottom-up methods on the Visual Genome dataset by 26% while preserving real-time performance.

Requirements

This codebase is based on the publicly-available repository openpifpaf/openpifpaf. We modify certain files from OpenPifPaf and add other parts as plugins. We also include a modified version of apex that Scene-Graph-Benchmark.pytorch relies on for evaluation. The main dependencies of this codebase are:

We recommend before installing the requirements to create a virtual environment where all packages will be installed (link).

First, make sure that inside the main folder (SGG-CoRF) you have the openpifpaf and apex folder. Activate the virtual environment (optional)

Then, install the requirements:


cd SGG-CoRF

pip install numpy Cython

cd openpifpaf

pip install --editable '.[dev,train,test]'

pip install tqdm h5py graphviz ninja yacs cython matplotlib tqdm opencv-python overrides timm


# Make sure to re-install the correct pytorch version for your GPU from https://pytorch.org/get-started/locally/

# install apex
cd ../apex
python setup.py install --cuda_ext --cpp_ext

# install PyTorch Detection
cd ../
git clone https://github.com/KaihuaTang/Scene-Graph-Benchmark.pytorch.git
cd Scene-Graph-Benchmark.pytorch

python setup.py build develop

cd ../openpifpaf

Note, when running the training or evaluation, if your code crashes because of an error related to torch_six.PY3, follow these steps:

cd Scene-Graph-Benchmark.pytorch

vim maskrcnn_benchmark/utils/imports.py

# change the line torch_six.PY3 to torch_six.PY37

To perform the following steps, make sure to be in the main openpifpaf directory (SGG-CoRF/openpifpaf).

In order to train the model, the dataset needs to be downloaded and pre-processed:

Create a folder called data and inside it a folder called visual_genome
Download images from Visual Genome (parts 1 and 2)
Place all images into data/visual_genome/VG_100K/
Create VG-SGG.h5, imdb_512.h5, imdb_1024.h5, VG-SGG-dicts.json by following here and place them in data/visual_genome/. To create imdb_512.h5, you will need to change the 1024 to 512 in create_imdb.sh.

Training

To train the model(s) in the paper, run these commands from the main openpifpaf directory:

To train a ResNet-50 model with the transformer modules:


python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \                                                                 
--epochs=60 --lr-decay 40 50 \                                                    
--batch-size=40 --weight-decay=1e-5 --basenet=resnet50 \              
--resnet-pool0-stride=2 --resnet-block5-dilation=2 \                                                                        
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head

To train a Swin-S model with the transformer module:


python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \                                                                 
--epochs=60 --lr-decay 40 50 \                                                    
--batch-size=40 --weight-decay=1e-5 --swin-use-fpn --basenet=swin_s \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head

To train a ResNet-50 model without the transformer modules:


python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
 --epochs=60 --lr-decay 40 50 \
 --batch-size=40 --weight-decay=1e-5 --basenet=resnet50 \
 --resnet-pool0-stride=2 --resnet-block5-dilation=2 \
 --vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
 --cf3-deform-deep4-head --cf3-deform-bn --cntrnet-deform-bn --cntrnet-deform-deep4-head --adamw"

To train a Swin-S model without the transformer modules:


python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --swin-use-fpn --basenet=swin_s \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-deep4-head --cf3-deform-bn --cntrnet-deform-bn --cntrnet-deform-deep4-head --adamw"

To train a ResNet-50 model with the transformer modules with GT detection tokes as input (PredCls):


python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --basenet=resnet50 \
--resnet-pool0-stride=2 --resnet-block5-dilation=2 \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head --cntrnet-deform-prior-token predcls --cf3-deform-prior-token prior_vect_detcls \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --vg-cn-pairwise

To train a ResNet-50 model with the transformer modules with GT detection tokes as input (SGCls):


python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --basenet=resnet50 \
--resnet-pool0-stride=2 --resnet-block5-dilation=2 \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head --cntrnet-deform-prior-token sgcls --cf3-deform-prior-token prior_vect_det \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --vg-cn-pairwise

To train a Swin-S model with the transformer modules with GT detection tokes as input (PredCls):


python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --swin-use-fpn --basenet=swin_s \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head --cntrnet-deform-prior-token predcls --cf3-deform-prior-token prior_vect_detcls \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --vg-cn-pairwise

To train a Swin-S model with the transformer modules with GT detection tokes as input (SGCls):


python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --swin-use-fpn --basenet=swin_s \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head --cntrnet-deform-prior-token sgcls --cf3-deform-prior-token prior_vect_det \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --vg-cn-pairwise

Note, to perform distributed training on multiple GPUs, as mentioned in the paper, add the following argument --ddp after openpifpaf.train in the commands above.

Pre-trained Models

You can download the pretrained models from here:

Pretrained ResNet-50 Model with transformers trained on Visual Genome.
Pretrained ResNet-50 Model without transformers trained on Visual Genome.
Pretrained Swin-S Model with transformers trained on Visual Genome.
Pretrained Swin-S Model without transformers trained on Visual Genome.

Models with GT detection tokens as input:

Pretrained ResNet-50 Model with transformers (for PredCls) trained on Visual Genome.
Pretrained ResNet-50 Model with transformers (for SGCls) trained on Visual Genome.
Pretrained Swin-S Model with transformers (for PredCls) trained on Visual Genome.
Pretrained Swin-S Model with transformers (for SGCls) trained on Visual Genome.

Then follow these steps:

Create the folder outputs inside the main openpifpaf directory (if necessary)
Place the downloaded models inside the outputs folder

These models will produce the results reported in the paper.

Evaluation

To evaluate the model on Visual Genome, go to the main openpifpaf* directory.

To evaluate the ResNet-50 model with transformers:

python3 -m openpifpaf.eval_cn \                                                                                              
  --checkpoint outputs/resnet50-211112-230017-478190-vg_cn-rsmooth1.0.pkl.epoch060 \
  --loader-workers=2 \
  --resnet-pool0-stride=2 --resnet-block5-dilation=2 \
  --dataset vg --decoder cifdetraf_cn --vg-cn-use-512  --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
  --vg-cn-single-head --cf3-deform-use-transformer

To evaluate the ResNet-50 model without transformers:


python3 -m openpifpaf.eval_cn \
  --checkpoint outputs/resnet50-211112-222522-724409-vg_cn-rsmooth1.0.pkl.epoch060 \
  --loader-workers=2 \
  --resnet-pool0-stride=2 --resnet-block5-dilation=2 \
  --dataset vg --decoder cifdetraf_cn --vg-cn-use-512  --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
  --cf3-deform-deep4-head --cf3-deform-bn --cntrnet-deform-bn --cntrnet-deform-deep4-head

To evaluate the Swin-S model with transformers:


python3 -m openpifpaf.eval_cn \
  --checkpoint outputs/swin_s-211113-181213-431757-vg_cn-rsmooth1.0.pkl.epoch060 \
  --loader-workers=2 --swin-use-fpn \
  --dataset vg --decoder cifdetraf_cn --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
  --vg-cn-single-head --cf3-deform-use-transformer

To evaluate the Swin-S model without transformers:


python3 -m openpifpaf.eval_cn \                                                                                              
  --checkpoint outputs/swin_s-211113-040932-535687-vg_cn-rsmooth1.0.pkl.epoch060 \
  --loader-workers=2 --swin-use-fpn \
  --dataset vg --decoder cifdetraf_cn --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
  --cf3-deform-deform4-head --cf3-deform-bn --cntrnet-deform-bn --cntrnet-deform-deep4-head

To evaluate the ResNet-50 model with transformers for PredCls (GT detection tokens):


python3 -m openpifpaf.eval_cn \
  --checkpoint outputs/resnet50-220130-123437-510938-vg_cn-rsmooth1.0.pkl.epoch060 \
  --loader-workers=2 \
  --resnet-pool0-stride=2 --resnet-block5-dilation=2 \
  --dataset vg --decoder cifdetraf_cn --vg-cn-use-512  --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
  --cf3-deform-use-transformer --vg-cn-single-head --cntrnet-deform-prior-token predcls --cf3-deform-prior-token prior_vect_detcls --vg-cn-prior-token predcls_raf \
  --cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --run-metric --vg-cn-pairwise

To evaluate the ResNet-50 model with transformers for SGCls (GT detection tokens):


python3 -m openpifpaf.eval_cn \                                       
  --checkpoint outputs/resnet50-220130-123433-862994-vg_cn-rsmooth1.0.pkl.epoch060 \
  --loader-workers=2 \
  --resnet-pool0-stride=2 --resnet-block5-dilation=2 \
  --dataset vg --decoder cifdetraf_cn --vg-cn-use-512  --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
  --cf3-deform-use-transformer --vg-cn-single-head --cntrnet-deform-prior-token sgcls --cf3-deform-prior-token prior_vect_det --vg-cn-prior-token predcls_raf \
  --cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --vg-cn-pairwise

To evaluate the Swin-s model with transformers for PredCls (GT detection tokens):


python3 -m openpifpaf.eval_cn \
  --checkpoint outputs/swin_s-220129-225026-232917-vg_cn-rsmooth1.0.pkl.epoch060 \
  --loader-workers=2 --swin-use-fpn \
  --dataset vg --decoder cifdetraf_cn --vg-cn-group-deform --vg-cn-single-supervision \
  --vg-cn-single-head --cf3-deform-use-transformer --cntrnet-deform-prior-token predcls --cf3-deform-prior-token prior_vect_detcls --vg-cn-prior-token predcls_raf \
  --cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --run-metric --vg-cn-use-512 --vg-cn-pairwise

To evaluate the Swin-S model with transformers for SGCls (GT detection tokens):


python3 -m openpifpaf.eval_cn \
  --checkpoint outputs/izar_outputs/swin_s-220130-054019-006800-vg_cn-rsmooth1.0.pkl.epoch060 \
  --loader-workers=2 --swin-use-fpn \
  --dataset vg --decoder cifdetraf_cn --vg-cn-group-deform --vg-cn-single-supervision \
  --vg-cn-single-head --cf3-deform-use-transformer --cntrnet-deform-prior-token sgcls --cf3-deform-prior-token prior_vect_det --vg-cn-prior-token predcls_raf \
  --cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --run-metric --vg-cn-use-512 --vg-cn-pairwise

Results

Our model achieves the following performance on Visual Genome :

Results