Home

Awesome

TensorFlow Implementation of DBNet

This repository is the TensorFlow implementation of DBNet, a method for localizing and detecting visual entities with natural language queries. DBNet is proposed in the follow paper:

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries, <br> Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, Honglak Lee <br> In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. spotlight

Remarks:

How to clone this repository

This GIT repository have submodules, please use the follow command to clone it.

git clone git clone --recursive https://github.com/yuanluya/nldet_TensorFlow

If you have clone the repository without the --recursive flag, you can do git submodule update --init --recursive in your local repository folder.

The evaluation submodule requires additional setup steps. Please refer to [./nlvd_evaluation/README.md] (https://github.com/YutingZhang/nlvd_evaluation)

Detection examples

Here are two detection examples:

<img src='examples/2347567.jpg' width=600 height=500> <img src='examples/2405273.jpg' width=600 height=450>

Introduction to DBNet

DBNet is a two-pathway deep neural network framework. It uses two separate pathways to extract visual and linguistic features, and uses a discriminative network to compute the matching score between the image region and the text phrase. DBNet is trained with a classifier with extensive use of negative samples. The training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples.

For more details about DBNet, please refer to the paper.

Prerequisites

If you have admin/root access to your workstation, you can remove --user and use sudo to install them into the system folder.

What are included

Data to download

Pretrained Models

The pretrained models can be obtained via this link. This model was trained according to the following procedure from scratch. It outperforms the model used in the paper slightly. Its evaluation results are summarized as follows.

IoU Threshold0.10.20.30.40.50.60.7
Recall56.647.840.132.425.017.610.7
Top Overlap MedianTop Overlap Mean
0.1740.270
ThresholdgAPmAP
0.325.349.8
0.512.331.4
0.72.612.4
ThresholdgAPmAP
0.322.846.7
0.511.229.7
0.72.412.0
ThresholdgAPmAP
0.39.628.4
0.55.019.0
0.71.28.2

Code Overview

Usage

You can use python3 main.py to run our code with default config setting, see config.py for detailed configuration definitions. You can overwrite the default configuration in config.py by parsing the corresponding arguments to main.py (see the examples later in this section).

Detecting and Visualizing on Sample Images

Demo on Images from Visual Genome (Quick Demo)

You can run a quick demo on Visual Genome images with a user-specified query.

Demo on Other Images

To perform detection on non-Visual-Genome images, an external region proposal method is needed. Our code supports EdgeBox. You can download the EdgeBox python interface to the repository root and run our code. Please make sure that the ENV_PATHS.EDGE_BOX_RPN is pointing to location of edge_boxes.py. The test procedure is the same as testing on Visual Genome images, except that, you will need to use absolute paths in the json file rather than image ids to list the test images.

Training DBNet

  1. Download images from the Visual Genome website and our spell-checked text annotations.
  2. Change config.py according to your data paths.
  3. Either download our trained model to finetuning or perform training from scratch.
  4. To finetune a pretrained model, please download and make sure config.py has the correct paths to the two .npy files (one is for the image pathway, and the other one is for the text pathway).

Training from Scratch

To train from scratch, we recommend using the faster RCNN to initialize the image pathway and randomly initialize the text pathway with our default parameters. After that, the DBNet model can be trained in 3 phases.

python3 main.py --PHASE phase1 --text_lr 1e-4 --image_lr_conv 0 --image_lr_region 0 --IMAGE_FINE_TUNE_MODEL frcnn_Region_Feat_Net.npy --TEXT_FINE_TUNE_MODEL XXX.npy --MAX_ITERS 50000

python3 main.py --PHASE phase2 --text_lr 1e-4 --image_lr_conv 1e-3 --image_lr_region 1e-3 --INIT_SNAPSHOT phase1 --INIT_ITER 50000 --MAX_ITERS 150000

python3 main.py --PHASE phase3 --INIT_SNAPSHOT phase2 --INIT_ITER 200000 --MAX_ITERS 100000

Model snapshots will be saved every --SAVE_ITERS to --SNAPSHOT_DIR. We name the snapshots as nldet_[PHASE]_[ITER].

Benchmarking on Visual Genome

To test with pretrained model, you can place .npy files to the default directory and run python3 main.py --MODE test. To test TensorFlow models trained from scratch, please change --INIT_SNAPSHOT and --INIT_ITER flags accordingly.

The detection results will be saved in a subfolder tmp_output under the directory nlvd_evaluation/results/vg_v1/dbnet_[IMAGE MODEL]/ in nlvd_evaluation submodule. IMAGE MODEL refers to the model used in the image path way and can be set by --IMAGE_MODEL flag in config.py. By default --IMAGE_MODEL is set to vgg16 and our model also supports resnet101. These tempory results will be merged together and saved in a .txt file, which can be used by our evaluation code directly. As long as results in tmp_output are saved, testing process can be resumed at anytime. Change the --LEVEL flag in config.py to perform the three-level tests in the paper.

python3 main.py --MODE test --LEVEL level_0 --INIT_SNAPSHOT phase3 --INIT_ITER 300000

Evaluation

The evaluation and dataset development code is cloned from the nlvd_evaluation repository as a submodule of this code. You can refer to this page for more detailed instructions for how to compute the performance metrics.

Contributors

This repository is mainly contributed by Luyao Yuan and Binghao Deng. The evaluation code is provided by Yijie Guo