Home

Awesome

<br /> <p align="center"> <h1 align="center">🔭 VSR: Visual Spatial Reasoning</h1> <h3 align="center">A probing benchmark for spatial undersranding of vision-language models.</h3> <p align="center"> <a href="https://arxiv.org/abs/2205.00363">arxiv</a> · <a href="https://github.com/cambridgeltl/visual-spatial-reasoning/tree/master/data">dataset</a> · <a href="https://paperswithcode.com/sota/visual-reasoning-on-vsr">benchmark</a> </p> </p>

Update [Mar 22, 2023]: We updated our arxiv preprint with the camera-ready version) and also the dataset in this repo to be consistent with the accepted paper. If you used an earlier version of VSR, you can refer to the earlier version of the preprint (v1) and the earlier snapshot of this repo. <br> Update [Feb 10, 2023]: Check out CLIP_visual-spatial-reasoning by @Sohojoe where you can find CLIP's performance on VSR. <br> Update [Feb 3, 2023]: Visual Spatial Reasoning is accepted to TACL 🥂! Stay tuned for the camera-ready version!<br>


1 Overview

The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False). Below are a few examples.

The cat is behind the laptop. (True)The cow is ahead of the person. (False)The cake is at the edge of the dining table. (True)The horse is left of the person. (False)

1.1 Why VSR?

Understanding spatial relations is fundamental to achieve intelligence. Existing vision-language reasoning datasets are great but they compose multiple types of challenges and can thus conflate different sources of error. The VSR corpus focuses specifically on spatial relations so we can have accurate diagnosis and maximum interpretability.

1.2 What have we found?

Below are baselines' by-relation performances on VSR (random split). More data != better performance. The relations are sorted by frequencies from left to right. The VLMs' by-relation performances have little correlation with relation frequency, meaning that more training data do not necessarily lead to better performance.

<img align="right" width="320" src="figures/performance_by_meta_cat_random_split_v4.png">

Understanding object orientation is hard. After classifying spatial relations into meta-categories, we can clearly see that all models are at chance level for "orientation"-related relations (such as "facing", "facing away from", "parallel to", etc.).

For more findings and takeways including zero-shot split performance. check out our paper!

2 The VSR dataset: Splits, statistics, and meta-data

The VSR corpus, after validation, contains 10,972 data points with high agreement. On top of these, we create two splits (1) random split and (2) zero-shot split. For random split, we randomly split all data points into train, development, and test sets. Zero-shot split makes sure that train, development and test sets have no overlap of concepts (i.e., if dog is in test set, it is not used for training and development). Below are some basic statistics of the two splits.

splittraindevtesttotal
random7,6801,0972,19510,972
zero-shot4,7132316165,560

Check out data/ for more details.

You can also load VSR from huggingface [🤗vsr_random] & [🤗vsr_zeroshot]:

from datasets import load_dataset

data_files = {"train": "train.jsonl", "dev": "dev.jsonl", "test": "test.jsonl"}
dataset = load_dataset("cambridgeltl/vsr_random", data_files=data_files)

Note that the image files still need to be downloaded separately as suggested in data/.

3 Baselines: Performance

We test four baselines, all supported in huggingface. They are VisualBERT (Li et al. 2019), LXMERT (Tan and Bansal, 2019), ViLT (Kim et al. 2021), and CLIP (Radford et al. 2021).

modelrandom splitzero-shot
human95.495.4
CLIP (frozen)56.054.5
CLIP (finetuned)*65.1-
VisualBERT55.251.0
ViLT69.363.0
LXMERT70.161.2

*CLIP (finetuned) result is from here.

4 Baselines: How to run?

Download images

See data/ folder's readme. Images should be saved under data/images/.

Environment

Depending on your system configuration and CUDA version, you might need two sets of environment: one environment for feature extraction (i.e, "Extract visual embeddings" section below) and one environment for all other experiments. You can install feature extraction environment by running feature_extraction/feature_extraction_environment.sh (specifically, feature extraction requires detectron2==0.5, CUDA==11.1 and torch==1.8). The default configuration for running other things can be found in requirements.txt.

Extract visual embeddings

For VisualBERT and LXMERT, we need to first extract visual embeddings using pre-trained object detectors. This can be done through

bash feature_extraction/lxmert/extract.sh

VisualBERT feature extraction is done similarly by replacing lxmert with visualbert. The features will be stored under data/features/ and automatically loaded when running training and evaluation scripts of LXMERT and VisualBERT. The feature extraction codes are modified from huggingface examples here (for VisualBERT) and here (for LXMERT).

Train

scripts/ contain some example bash scripts for training and evaluation. For example, the following script trains LXMERT on the random split:

bash scripts/lxmert_train.sh 0

where 0 denotes device index. Configurations such as checkpoint saving address can be modified in the script.

Evaluation

Similarly, evaluating the obtained LXMERT model can be done by running:

bash scripts/lxmert_eval.sh 0

Configurations such as checkpoint reading address can be modified in the script.

In analysis_scripts/ you can checkout how to print out by-relation and by-meta-category accuracies.

Citation

If you find VSR useful:

@article{Liu2022VisualSR,
  title={Visual Spatial Reasoning},
  author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier},
  journal={Transactions of the Association for Computational Linguistics},
  year={2023},
}

License

This project is licensed under the Apache-2.0 License.