Home

Awesome

A negative case analysis of visual grounding methods for VQA (ACL 2020 short paper)

Recent works in VQA attempt to improve visual grounding by training the model to attend to query-relevant visual regions. Such methods have claimed impressive gains in challenging datasets such as VQA-CP. However, in this work we show that boosts in performance come from a regularization effect as opposed to proper visual grounding.

Visual Grounding

This repo is based on Self-Critical Reasoning codebase.

Install dependencies

We use Anaconda to manage our dependencies. You will need to execute the following steps to install all dependencies:

Executing scripts

While executing scripts, first ensure that your main project directory is in PYTHONPATH:

cd ${PROJ_DIR} && export PYTHONPATH=.

Setting up data

Training baseline model

We are providing pre-trained models for both VQAv2 and VQA-CPv2 here

To train the baselines yourself execute ./scripts/baseline/vqacp2_baseline.sh.

Training state-of-the-art models

Setting up data

The following scripts train HINT/SCR with a) relevant cues b) irrelevant cues c) fixed random cues and d) varying random cues:

Training HINT [1]

Execute ./scripts/hint/vqacp2_hint.sh for VQACPv2

Execute ./scripts/hint/vqa2_hint.sh for VQAv2

Training SCR [2]

Execute ./scripts/scr/vqacp2_scr.sh for VQACPv2

Execute ./scripts/scr/vqa2_scr.sh for VQAv2

Note: By default, HINT and SCR are only trained on subset with visual cues. To train on full dataset, please specify --do_not_discard_items_without_hints flag.

Training our 'zero-out' regularizer

Analysis

Computing rank correlation

Please refer to scripts/analysis/compute_rank_correlation.sh for sample scripts that can be used to compute rank correlations. The script uses the object sensitivity files generated during the training/evaluation.

References

[1] Selvaraju, Ramprasaath R., et al. "Taking a hint: Leveraging explanations to make vision and language models more grounded." Proceedings of the IEEE International Conference on Computer Vision. 2019.

[2] Wu, Jialin, and Raymond Mooney. "Self-Critical Reasoning for Robust Visual Question Answering." Advances in Neural Information Processing Systems. 2019.

Citation

@inproceedings{shrestha-etal-2020-negative,
    title = "A negative case analysis of visual grounding methods for {VQA}",
    author = "Shrestha, Robik  and
      Kafle, Kushal  and
      Kanan, Christopher",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.727",
    pages = "8172--8181"
}