Awesome
cvpr22
This is the source code of our paper for CVPR 2022. The original idea of this paper is the multi-modal dynamic graph transformer framework for progressive learning on visual grounding.
Summary
Introduction
The multi-modal dynamic graph transformer (M-DGT) framework is built upon a novel idea that models the learning of the visual grounding as the graph transformation. To accomplish this, there are two crucial components, including the multi-modal node transformer and the graph transformer. With the anchor in the image as the node, the multi-modal node transformer firstly constructs a graph based on the spatial positions of anchors. Then it continuously produces the 2D transformation coefficients to modify these spatial positions to approach the ground truth regions. Then, during this process, the graph transformer optimizes the structure of the graph by removing nodes and unnecessary edges to motivate efficient learning. Therefore, the whole framework can be regarded as generating a series of dynamic graphs that gradually shrink to the target regions. The performance of the M-DGT is measured in two parts, including visual grounding and phrase grounding. The ReferItGame and RefCOCO datasets are used to measure the performance of the M-DGT on visual grounding, while the Flickr30K Entities dataset is utilized to test the M-DGT on phrase grounding.
Getting Started
Please install the necessary Python packages before running the code. The main Python packages are PyTorch, torchvision, mmcv, opencv, albumentations.
Then, you need to download three corresponding datasets and then set the path in the 'common_flags.py'.
Datasets
This paper utilizes the following three classic datasets:
- ReferItGame dataset. Please check the data provider souce code for this dataset in the 'datasets/ReferItGame_provider'.
- Flickr30K Entities dataset. Please check the data provider souce code for this dataset in the 'datasets/F30KE_provider'.
- RefCOCO dataset. Please check the data provider souce code for this dataset in the 'datasets/ReferItGame_provider'.
Structure
Folder structure and functions for M-DGT
.
├── datasets # Three datasets used in experiments
├── experiments # The directory used to save the model and logging file
├── learning # The train, evaluation, and losses parts
├── models # The components and main structures of M-DGT
├── preprocess # The function used to preprocess the original image
├── visualization # Visualization part of M-DGT
└── common_flags.md # Set common paths for datasets
└── f30k_eval.py # Evaluate the model on eval/test sets of the Flickr30K Entities dataset.
└── f30k_train.py # Train the model on the train set of the Flickr30K Entities dataset.
└── refcoco_eval.py # Evaluate the model on eval/tests set of ReferItGame/RefCOCO datasets.
└── refcoco_train.py # Train the model on the train set of ReferItGame/RefCOCO datasets.
└── README.md
Operation
After setting the environment and three datasets, the model can be trained directly by running:
- Operations on the Flickr30K Entities dataset
Train the model
cvpr@cvpr22:~$ python f30k_train.py
Test the model; Set 'phase' to switch between test and val.
cvpr@cvpr22:~$ python f30k_eval.py
- Operations on the RefCOCO/ReferItGame dataset
Train the model; Set 'data_name' and 'split_type' to switch between different datasets.
cvpr@cvpr22:~$ python refcoco_train.py
Test the model; Set 'phase' to switch between test and val.
cvpr@cvpr22:~$ python refcoco_eval.py
Performance
Flickr30K Entities
(%) | top-1 accuracy |
---|---|
SOTA | 76.74 Learning Cross-Modal Context Graph |
M-DGT | 79.97 |
RefCOCO
(%) | ReferCOCO | ReferCOCO+ | ReferCOCOg |
---|---|---|---|
type | Val / TestA / TestB | Val / TestA / TestB | Val / Test |
SOTA | 82 / 81.20 / 84.00 | 66.6 / 67.6 / 65.5 | 75.73 / 75.31 |
MMDGT | 85.37 / 84.82 / 87.11 | 70.02 / 72.26 / 68.92 | 79.21 / 79.06 |
Acknowledgements
Our implementations refer to the source code from the following repositories and users: