Composite Relationship Fields with Transformers for Scene Graph Generation
This codebase is the official implementation of "Composite Relationship Fields with Transformers for Scene Graph Generation" (Accepted at WACV2023).
Scene graph generation (SGG) methods aim to extract a structured semantic representation of a scene by detecting the objects present and their relationships. While most methods focus on improving top-down approaches, which build a scene graph based on predicted objects from an off-the-shelf object detector, there is a limited amount of work on bottom-up approaches, which directly predict objects and their relationships in a single stage.
In this work, we present a novel bottom-up SGG approach by representing relationships using Composite Relationship Fields (CoRF). CoRF turns relationship detection into a dense regression and classification task, where each cell of the output feature map identifies surrounding objects and their relationships. Furthermore, we propose a refinement head that leverages Transformers for global scene reasoning, resulting in more meaningful relationship predictions. By combining both contributions, our method outperforms previous bottom-up methods on the Visual Genome dataset by 26% while preserving real-time performance.
This codebase is based on the publicly-available repository openpifpaf/openpifpaf. We modify certain files from OpenPifPaf and add other parts as plugins. We also include a modified version of apex that Scene-Graph-Benchmark.pytorch relies on for evaluation. The main dependencies of this codebase are:
- Python 3.8.5
- Apex
- Openpifpaf
- Scene-Graph-Benchmark.pytorch
We recommend before installing the requirements to create a virtual environment where all packages will be installed (link).
First, make sure that inside the main folder (SGG-CoRF
) you have the openpifpaf and apex folder. Activate the virtual environment (optional)
Then, install the requirements:
pip install numpy Cython
cd openpifpaf
pip install --editable '.[dev,train,test]'
pip install tqdm h5py graphviz ninja yacs cython matplotlib tqdm opencv-python overrides timm
# Make sure to re-install the correct pytorch version for your GPU from
# install apex
cd ../apex
python install --cuda_ext --cpp_ext
# install PyTorch Detection
cd ../
git clone
cd Scene-Graph-Benchmark.pytorch
python build develop
cd ../openpifpaf
Note, when running the training or evaluation, if your code crashes because of an error related to torch_six.PY3, follow these steps:
cd Scene-Graph-Benchmark.pytorch
vim maskrcnn_benchmark/utils/
# change the line torch_six.PY3 to torch_six.PY37
To perform the following steps, make sure to be in the main openpifpaf directory (SGG-CoRF/openpifpaf
In order to train the model, the dataset needs to be downloaded and pre-processed:
- Create a folder called
and inside it a folder calledvisual_genome
- Download images from Visual Genome (parts 1 and 2)
- Place all images into
- Create VG-SGG.h5, imdb_512.h5, imdb_1024.h5, VG-SGG-dicts.json by following here and place them in
. To create imdb_512.h5, you will need to change the 1024 to 512
To train the model(s) in the paper, run these commands from the main openpifpaf directory:
To train a ResNet-50 model with the transformer modules:
python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --basenet=resnet50 \
--resnet-pool0-stride=2 --resnet-block5-dilation=2 \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head
To train a Swin-S model with the transformer module:
python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --swin-use-fpn --basenet=swin_s \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head
To train a ResNet-50 model without the transformer modules:
python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --basenet=resnet50 \
--resnet-pool0-stride=2 --resnet-block5-dilation=2 \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-deep4-head --cf3-deform-bn --cntrnet-deform-bn --cntrnet-deform-deep4-head --adamw"
To train a Swin-S model without the transformer modules:
python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --swin-use-fpn --basenet=swin_s \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-deep4-head --cf3-deform-bn --cntrnet-deform-bn --cntrnet-deform-deep4-head --adamw"
To train a ResNet-50 model with the transformer modules with GT detection tokes as input (PredCls):
python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --basenet=resnet50 \
--resnet-pool0-stride=2 --resnet-block5-dilation=2 \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head --cntrnet-deform-prior-token predcls --cf3-deform-prior-token prior_vect_detcls \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --vg-cn-pairwise
To train a ResNet-50 model with the transformer modules with GT detection tokes as input (SGCls):
python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --basenet=resnet50 \
--resnet-pool0-stride=2 --resnet-block5-dilation=2 \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head --cntrnet-deform-prior-token sgcls --cf3-deform-prior-token prior_vect_det \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --vg-cn-pairwise
To train a Swin-S model with the transformer modules with GT detection tokes as input (PredCls):
python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --swin-use-fpn --basenet=swin_s \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head --cntrnet-deform-prior-token predcls --cf3-deform-prior-token prior_vect_detcls \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --vg-cn-pairwise
To train a Swin-S model with the transformer modules with GT detection tokes as input (SGCls):
python -m openpifpaf.train --lr=1e-4 --lr-basenet=1e-5 --b-scale=10.0 \
--epochs=60 --lr-decay 40 50 \
--batch-size=40 --weight-decay=1e-5 --swin-use-fpn --basenet=swin_s \
--vg-cn-upsample 1 --dataset vg --vg-cn-square-edge 512 --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision \
--cf3-deform-use-transformer --adamw --vg-cn-single-head --cntrnet-deform-prior-token sgcls --cf3-deform-prior-token prior_vect_det \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --vg-cn-pairwise
Note, to perform distributed training on multiple GPUs, as mentioned in the paper, add the following argument --ddp
after openpifpaf.train
in the commands above.
Pre-trained Models
You can download the pretrained models from here:
- Pretrained ResNet-50 Model with transformers trained on Visual Genome.
- Pretrained ResNet-50 Model without transformers trained on Visual Genome.
- Pretrained Swin-S Model with transformers trained on Visual Genome.
- Pretrained Swin-S Model without transformers trained on Visual Genome.
Models with GT detection tokens as input:
- Pretrained ResNet-50 Model with transformers (for PredCls) trained on Visual Genome.
- Pretrained ResNet-50 Model with transformers (for SGCls) trained on Visual Genome.
- Pretrained Swin-S Model with transformers (for PredCls) trained on Visual Genome.
- Pretrained Swin-S Model with transformers (for SGCls) trained on Visual Genome.
Then follow these steps:
- Create the folder
inside the main openpifpaf directory (if necessary) - Place the downloaded models inside the
These models will produce the results reported in the paper.
To evaluate the model on Visual Genome, go to the main openpifpaf* directory.
To evaluate the ResNet-50 model with transformers:
python3 -m openpifpaf.eval_cn \
--checkpoint outputs/resnet50-211112-230017-478190-vg_cn-rsmooth1.0.pkl.epoch060 \
--loader-workers=2 \
--resnet-pool0-stride=2 --resnet-block5-dilation=2 \
--dataset vg --decoder cifdetraf_cn --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
--vg-cn-single-head --cf3-deform-use-transformer
To evaluate the ResNet-50 model without transformers:
python3 -m openpifpaf.eval_cn \
--checkpoint outputs/resnet50-211112-222522-724409-vg_cn-rsmooth1.0.pkl.epoch060 \
--loader-workers=2 \
--resnet-pool0-stride=2 --resnet-block5-dilation=2 \
--dataset vg --decoder cifdetraf_cn --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
--cf3-deform-deep4-head --cf3-deform-bn --cntrnet-deform-bn --cntrnet-deform-deep4-head
To evaluate the Swin-S model with transformers:
python3 -m openpifpaf.eval_cn \
--checkpoint outputs/swin_s-211113-181213-431757-vg_cn-rsmooth1.0.pkl.epoch060 \
--loader-workers=2 --swin-use-fpn \
--dataset vg --decoder cifdetraf_cn --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
--vg-cn-single-head --cf3-deform-use-transformer
To evaluate the Swin-S model without transformers:
python3 -m openpifpaf.eval_cn \
--checkpoint outputs/swin_s-211113-040932-535687-vg_cn-rsmooth1.0.pkl.epoch060 \
--loader-workers=2 --swin-use-fpn \
--dataset vg --decoder cifdetraf_cn --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
--cf3-deform-deform4-head --cf3-deform-bn --cntrnet-deform-bn --cntrnet-deform-deep4-head
To evaluate the ResNet-50 model with transformers for PredCls (GT detection tokens):
python3 -m openpifpaf.eval_cn \
--checkpoint outputs/resnet50-220130-123437-510938-vg_cn-rsmooth1.0.pkl.epoch060 \
--loader-workers=2 \
--resnet-pool0-stride=2 --resnet-block5-dilation=2 \
--dataset vg --decoder cifdetraf_cn --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
--cf3-deform-use-transformer --vg-cn-single-head --cntrnet-deform-prior-token predcls --cf3-deform-prior-token prior_vect_detcls --vg-cn-prior-token predcls_raf \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --run-metric --vg-cn-pairwise
To evaluate the ResNet-50 model with transformers for SGCls (GT detection tokens):
python3 -m openpifpaf.eval_cn \
--checkpoint outputs/resnet50-220130-123433-862994-vg_cn-rsmooth1.0.pkl.epoch060 \
--loader-workers=2 \
--resnet-pool0-stride=2 --resnet-block5-dilation=2 \
--dataset vg --decoder cifdetraf_cn --vg-cn-use-512 --vg-cn-group-deform --vg-cn-single-supervision --run-metric \
--cf3-deform-use-transformer --vg-cn-single-head --cntrnet-deform-prior-token sgcls --cf3-deform-prior-token prior_vect_det --vg-cn-prior-token predcls_raf \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --vg-cn-pairwise
To evaluate the Swin-s model with transformers for PredCls (GT detection tokens):
python3 -m openpifpaf.eval_cn \
--checkpoint outputs/swin_s-220129-225026-232917-vg_cn-rsmooth1.0.pkl.epoch060 \
--loader-workers=2 --swin-use-fpn \
--dataset vg --decoder cifdetraf_cn --vg-cn-group-deform --vg-cn-single-supervision \
--vg-cn-single-head --cf3-deform-use-transformer --cntrnet-deform-prior-token predcls --cf3-deform-prior-token prior_vect_detcls --vg-cn-prior-token predcls_raf \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --run-metric --vg-cn-use-512 --vg-cn-pairwise
To evaluate the Swin-S model with transformers for SGCls (GT detection tokens):
python3 -m openpifpaf.eval_cn \
--checkpoint outputs/izar_outputs/swin_s-220130-054019-006800-vg_cn-rsmooth1.0.pkl.epoch060 \
--loader-workers=2 --swin-use-fpn \
--dataset vg --decoder cifdetraf_cn --vg-cn-group-deform --vg-cn-single-supervision \
--vg-cn-single-head --cf3-deform-use-transformer --cntrnet-deform-prior-token sgcls --cf3-deform-prior-token prior_vect_det --vg-cn-prior-token predcls_raf \
--cntrnet-deform-prior-offset rel_offset --cf3-deform-prior-offset rel_offset --run-metric --vg-cn-use-512 --vg-cn-pairwise
Our model achieves the following performance on Visual Genome :