Awesome
Self-Supervised Learning for Visual Relationship Detection through Masked Bounding Box Reconstruction
Zacharias Anastasakis, Dimitrios Mallis, Markos Diomataris, George Alexandridis, Stefanos Kollias, Vassilis Pitsikalis
We propose Masked Bounding Box Reconstruction, a variation of Masked Image Modeling where a percentage of the entities/objects within a scene are masked and subsequently reconstructed based on the unmasked objects. Through object-level masked modeling, our proposed network learns context-aware representations that capture the interaction of objects within a scene and are highly predictive of visual object relationships.
This repository contains the code for reproducing our IEEE/CVF Winter Conference on Applications of Computer Vision 2024 paper and is based on the grounding-consistent-vrd. You can find our paper here.
Environment Setup
After cloning this repository, you can set up a conda environment using the mbbr.yml config file:
conda env create -f mbbr.yml
conda activate mbbr
Dataset Setup
You can download the VRD and/or VG200 dataset by running the main_prerequisites python file. You can define the dataset as an argument:
python3 main_prerequisites.py VG200
Train
Training involves 2 steps:
- Pre-train a transformer network in a self-supervised manner through Masked Bounding Box Reconstruction (MBBR)
python3 main_research.py --model=MBBR --net_name=MBBRNetwork --projection_head --dataset=VG200 --pretrain_arch=encoder
- Train an MLP network in a few-shot setting on random samples, using the pre-trained network from the previous step:
python3 main_research.py --model=SSL_finetune --net_name=FinetunedNetwork --dataset=VG200 --pretrain_arch=encoder --random_few_shot=10 --random_seed=4 --pretrained_model=MBBRNetwork --projection_head --normal --pretrain_task=reconstruction
The above command trains a 2-layer MLP network on 10 random samples from the VRD dataset. However, in our work we also manually selected {1,2,5} accurate relationships per Predicate Category and used them to train our classifier. These relationships are given in the prerequisites/{VG200/VRD}_few_shot_dict.json files. You can train a classifier on these manually-selected samples by running the following command:
python3 main_research.py --model=SSL_finetune --net_name=FinetunedNetwork --dataset=VG200 --pretrain_arch=encoder --few_shot=5 --pretrained_model=MBBRNetwork --projection_head --normal --pretrain_task=reconstruction
Test
After training, testing is automatically performed and micro/macro Recal@[20, 50, 100] is printed for both constrained and unconstrained scenarios while also calculating zero-shot results.
Checkpointing is performed so re-running step 2 for an already trained model will simply perform testing.
Citation
If you plan to use this code in your work or experiments, please use the following citation:
@INPROCEEDINGS{Anastasakis_WACV_2024,
author={Anastasakis, Zacharias and Mallis, Dimitrios and Diomataris, Markos and Alexandridis,
George and Kollias, Stefanos and Pitsikalis, Vassilis},
booktitle={2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
title={Self-Supervised Learning for Visual Relationship Detection through Masked Bounding Box
Reconstruction},
year={2024},
volume={},
number={},
pages={1195-1204},
keywords={Representation learning;Visualization;Computer vision;Codes;Self-supervised
learning;Predictive models;Task analysis;Algorithms;Image recognition and
understanding;Algorithms;Machine learning architectures;formulations;and algorithms},
doi={10.1109/WACV57701.2024.00124}}