Home

Awesome

Contrastive Learning for Weakly Supervised Phrase Grounding

By Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and Derek Hoiem

(ECCV 2020 Spotlight)

<p align="center"> <img src="imgs/info-ground-arch.png"> </p>

Available on Arxiv: https://arxiv.org/abs/2006.09920

Project Page: http://tanmaygupta.info/info-ground/

BibTex:

@article{gupta2020contrastive,
  title={Contrastive Learning for Weakly Supervised Phrase Grounding},
  author={Gupta, Tanmay and Vahdat, Arash and Chechik, Gal and Yang, Xiaodong and Kautz, Jan and Hoiem, Derek},
  booktitle={ECCV},
  year={2020}
}

Requirements

Create a conda environment with all dependencies provided in the environment.yml file using

conda env create -f environment.yml

Activate the environment with

conda activate info-ground

All commands in the following sections are to be executed in the same directory as this README.md file.

Setup file paths and data

<details><summary>COCO</summary>

Update the following paths in yaml/coco.yml:

In my setup downloads_dir, proc_dir, and exp_dir are directories on a shared NFS storage while image_dir and local_proc_dir point to local storage.

Once the paths are setup in yaml/coco.yml, run the following:

# download COCO images and annotations to downloads_dir
python -m data.coco.download
# extract annotations to coco_proc
python -m data.coco.extract_annos
# extract images to image_dir
python -m data.coco.extract_images
</details> <details><summary>Flickr</summary>

Set the download_dir variable in data/flickr/download.sh to the location where you would like to download the Flickr30K Entities annotations and splits from the github repository. Now, run the following the download and extract the contents of the downloaded annotations.zip file in the same directory:

# clone Flickr30K Entities github repo and extract annotations and splits
bash data/flickr/download.sh
# process annotations into easy to read json files
bash data/flickr/process_annos.sh

For access to Flickr30K images, please follow the instructions <a href="http://bryanplummer.com/Flickr30kEntities/">here</a>. You might be required to fill a form. Download the images to a convenient directory whose path will be referred to as image_dir.

Now, update the following paths in yaml/flickr.yml:

In my setup downloads_dir, proc_dir, and exp_dir are directories on a shared NFS storage while image_dir and local_proc_dir point to local storage.

</details>

Get object detections

We provide detections for COCO and Flickr30K images computed using a FasterRCNN model trained on VisualGenome object and attribute annotations originally used in the Bottom-Up and Top-Down Attention work and then reused in a recent weakly supervised phrase grounding work Align2Ground that we compare to.

We use a lightly modified fork of the pytorch implementation available here to extract bounding boxes, scores, and features from a set of images and save them in hdf5 format.

Download and extract detections to a desired location:

Update det_dir in yaml/coco.yml or yaml/flickr.yml to location where the detections were extracted.

Construct context-preserving negative captions

Follow the instructions for whichever dataset you want to train on.

<details><summary><b>Step 1:</b> Identity noun tokens to be substituted</summary>
# For COCO
bash exp/gen_noun_negatives/scripts/identify_tokens.sh train
bash exp/gen_noun_negatives/scripts/identify_tokens.sh val

# For Flickr
bash exp/gen_noun_negatives/scripts/identify_tokens_flickr.sh train
bash exp/gen_noun_negatives/scripts/identify_tokens_flickr.sh val

This creates the following files in <proc_dir>/annotations:

</details> <details><summary><b>Step 2:</b> Sample substitute words</summary>
# For COCO
bash exp/gen_noun_negatives/scripts/sample_neg_bert.sh train
bash exp/gen_noun_negatives/scripts/sample_neg_bert.sh val

# For Flickr
bash exp/gen_noun_negatives/scripts/sample_neg_bert_flickr.sh train
bash exp/gen_noun_negatives/scripts/sample_neg_bert_flickr.sh val

This creates the following files in <proc_dir>:

</details> <details><summary><b>Step 3:</b> Cache contextualized representations of the substituted words</summary>
# For COCO
bash exp/gen_noun_negatives/scripts/cache_neg_fetures.sh train
bash exp/gen_noun_negatives/scripts/cache_neg_fetures.sh val

# For Flickr
bash exp/gen_noun_negatives/scripts/cache_neg_fetures_flickr.sh train
bash exp/gen_noun_negatives/scripts/cache_neg_fetures_flickr.sh val

This creates the following files in <proc_dir>:

</details>

Learn to ground

Once we have the following, we are ready to train our grounding model:

<details><summary><b>Step 1:</b> Identify noun and adjective tokens to estimate mutual information with the image regions</summary>
# For COCO
bash exp/ground/scripts/identify_noun_adj_tokens.sh train
bash exp/ground/scripts/identify_noun_adj_tokens.sh val

# For Flickr
bash exp/ground/scripts/identify_noun_adj_tokens_flickr.sh train
bash exp/ground/scripts/identify_noun_adj_tokens_flickr.sh val

This creates <proc_dir>/annotations/noun_adj_tokens_<subset>.json

</details> <details><summary><b>Step 2:</b> Copy over detections and cached features from nfs (proc_dir) to local storage (local_proc_dir)</summary>

This may reduce training time if, for instance, <proc_dir> is a slow shared NFS and <local_proc_dir> is a faster local drive. Otherwise you may skip this step and set <local_proc_dir> to the same path as <proc_dir>.

To copy, modify path variables NFS_DATA and LOCAL_DATA in setup_coco.sh or setup_flickr.sh and execute

# For COCO
bash setup_coco.sh

# For Flickr
bash setup_flickr.sh
</details> <details><summary><b>Step 3:</b> Start training</summary>
# For COCO
bash exp/ground/scripts/train.sh model_trained_on_coco coco

# For Flickr
bash exp/ground/scripts/train.sh model_trained_on_flickr flickr

# General form
bash exp/ground/scripts/train.sh <exp_name> <training_dataset>
</details>

Evaluate on Flickr

To evaluate on Flickr, follow the instructions above to setup Flickr file paths, download/extract the dataset, and download object detections. If needed also run setup_flickr.sh to copy files from NFS to local disk after modifying NFS_DATA and LOCAL_DATA paths in the script.

<details><summary><b>Model Selection</b></summary>

As noted in our paper, we use ground truth annotations in the Flickr validation set for model selection. To perform model selection run

# For COCO
bash exp/ground/scripts/eval_flickr_phrase_loc_model_selection.sh model_trained_on_coco coco

# For Flickr
bash exp/ground/scripts/eval_flickr_phrase_loc_model_selection.sh model_trained_on_flickr flickr

# General form
bash exp/ground/scripts/eval_flickr_phrase_loc_model_selection.sh <exp_name> <training_dataset>
</details> <details><summary><b>Model Evaluation</b></summary>

To evaluate the selected model, run

# For COCO
bash exp/ground/scripts/eval_flickr_phrase_loc.sh model_trained_on_coco coco

# For Flickr
bash exp/ground/scripts/eval_flickr_phrase_loc.sh model_trained_on_flickr flickr

# General form
bash exp/ground/scripts/eval_flickr_phrase_loc.sh <exp_name> <training_dataset>

To provide a sense of variance to expect in pointing accuracy on Flickr30K Entities from training your own models using our repo, here's the performance of one run in comparison to the provided pretrained models:

Training DatasetFlickr Val AccuracyFlickr Test AccuracyFlickr Test Accuracy in Paper
Coco75.3876.1676.74
Flickr73.5774.7974.94
<br> </details> <details><summary><b>Pretrained Models</b></summary>

We provide pretrained models trained on both COCO and Flickr to reproduce the numbers in our paper. See exp/ground/eval_flickr_phrase_loc.py and exp/ground/run/eval_flickr_phrase_loc.py to understand how to load the model.

</details> <details><summary><b>Visualize Results</b></summary>

To visualize grounding on Flickr val set, execute the following:

# For Coco
bash exp/ground/scripts/vis_att.sh model_trained_on_coco coco

# For Flickr
bash exp/ground/scripts/vis_att.sh model_trained_on_flickr flickr

# General Form
bash exp/ground/scripts/vis_att.sh <exp_name> <training_dataset>

This would create html pages to visualize top 3 predicted bounding boxes for each word in the caption at <exp_dir>/vis/attention_flickr. Open imgs/example_visualization/index.html in a browser for an example visualization generated by this script.

</details>