Home

Awesome

Belief Revision Score

<img align="right" width="600" height="200" src="overview.png"> In this work, we focus on improving the captions generated by image-caption generation systems. We propose a novel re-ranking approach that leverages visual-semantic measures to identify the ideal caption that maximally captures the visual information in the image. Our re-ranker utilizes the Belief Revision framework (Blok et al., 2003) to calibrate the original likelihood of the top-n captions by explicitly exploiting the semantic relatedness between the depicted caption and the visual context. Our experiments demonstrate the utility of our approach, where we observe that our re-ranker can enhance the performance of a typical image-captioning system without the necessity of any additional training or fine-tuning. <br/> <br/>

arXiv Open In Collab Website shields.io huggingface huggingface COLING - slide COLING - poster

This repository contains the implementation of the paper Belief Revision based Caption Re-ranker with Visual Semantic Information

Contents

  1. Overview
  2. Visual Re-ranking with Belief Revision
  3. Dataset
  4. Model
  5. Visual Re-ranking with Negative Evidence
  6. Semantic Diversity Evaluation
  7. Cloze Prob based Belife Revision
  8. Other Task: Sentence Semantic Similarity
  9. Citation

Visual Re-ranking with Belief Revision

The Belief Revision is a conditional probability model which assumes that the preliminary probability finding is revised to the extent warranted by the hypothesis proof (in this work, the hypothesis proof is the visual context information from the image to revise and select the most related candidate caption that correlates directly with the image). The Belief Revision Score is written as:

<!-- <img src="https://render.githubusercontent.com/render/math?math=\text{P}(w \mid c)=\text{P}(w)^{\alpha}"> -->

$\text{P}(w \mid c)=\text{P}(w)^{\alpha}$

where the main components of hypothesis revision as caption visual semantics re-ranker:

<!-- 1. Hypothesis (caption candidates beam search) <img src="https://render.githubusercontent.com/render/math?math=\text{P}(w)"> initialized by common observation (_i.e._ language model) -->
  1. Hypothesis (caption candidates beam search) $\text{P}(w)$ initialized by common observation (i.e. language model)
<!-- 2. Informativeness <img src="https://render.githubusercontent.com/render/math?math=1-\text{P}(c)"> of the visual context from the image -->
  1. Informativeness $1-\text{P}(c)$ of the visual context from the image
<!-- 4. Similarities <img src="https://render.githubusercontent.com/render/math?math=\alpha=\left[\frac{1 - \text{sim}(w, c)}{1%2B\text{sim}(w, c)}\right]^{1-\text{P}(c)}"> the relatedness between the two concepts (visual context and hypothesis) with respect to the informativeness of the visual information. -->
  1. Similarities $\alpha=\left[\frac{1 - \text{sim}(w, c)}{1+\text{sim}(w, c)}\right]^{1-\text{P}(c)}$ the relatedness between the two concepts (visual context and hypothesis) with respect to the informativeness of the visual information.

Here is a Gradio_Demo / Gradio_Demo_with_hypothesis to show the Visual Re-ranking based Belief Revision

Dataset

We enrich COCO-caption with Textual Visual Context information. We use out-of-the-box visual classifiers to extract object information for each COCO-caption image.

VC1VC2VC3human annoated caption
cheeseburgerplatehotdoga plate with a hamburger fries and tomatoes
bakerydining tablewebsitea table having tea and a cake on it
gowngroomapronits time to cut the cake at this couples wedding

More information about the visual context extraction paper

Model

Here, we describe in more detail the implementation of belief revision as a visual re-ranker. We show that by integrating visual context information, a more descriptive caption is re-ranked higher. Our model can be used as a drop-in complement for any caption generation algorithm that outputs a list of candidate captions (e.g. beam search, nucleus sampling, etc.).

To run the Belief Revision-Score via visual context directly with GPT-2 ans SRoBERTa-sts

conda create --name BRscore  python=3.7
source activate BRscore
# tested with sentence_transformers-2.2.0
pip install sentence_transformers 

run slide demo with GPT-2+SRoBERTa with result in Belief-revision_re-rank.txt. For the huggingface Gradio_Demo

python model.py --c caption_demo.txt --vis visual_context_label_demo.txt --vis_prob visual_context_prob_demo.txt

Also interactive demo with huggingface-gardio by running this code or colab here

pip install gradio 
python demo.py 

To run each step separately, which gives you the flexibility to use different SoTA model (or your custom model)

First, we need to initialize the hypothesis with common observation i.e. lanaguage model (GPT2)

conda create -n LM-GPT python=3.7 anaconda
conda activate LM-GPT
pip install lm-scorer
python model/LM-GPT-2.py 

Second, we need the visual context from the image, and thus we need visual classifiers

python  model/run-visual.py

Finally, the relatedness between the two concepts (visual context and hypothesis)

Using fine-tuning BERT fine-tuning BERT

python BERT/train_model_VC.py 

Or general-purpose SBERT with cosine similarity

conda create -n SBERT python=3.7 anaconda
conda activate SBERT
pip install sentence-transformers
python model/SBERT_model_VC.py

Then run demo Example 1/2 (below)

 python model/Example_1/model.py --lm LM.txt --vis visual_context_lable.txt --vis_prob visual_context_prob.txt --c caption.txt

Note that each score is computed separately here (each score is in a separate file)

Go here for more details

Demo

Here are examples with the SBERT based model.

Example 1

<img align="center" width="400" height="200" src="example.jpg">

Baseline beam = 5

a city street filled with traffic at night       	 
a city street covered in snow at night	 
a city street covered in traffic at night time	 
a city street filled with traffic surrounded by tall buildings	 
a city street covered in traffic at night	 

Visual re-ranked beam search = 5

a city street filled with traffic surrounded by tall buildings 
a city street filled with traffic and traffic lights 
a city street filled with traffic surrounded by snow 
a city street filled with traffic at night 
a city street at night with cars and street lights 

We re-ranked the best 5 beams from 9 candidates captions, generated by the baseline, using the visual context information.

Example 2

<img align="center" width="400" height="200" src="example-2.jpg">

Baseline beam = 5

a longhorn cow with horns standing in a field 
two bulls standing next to each other	 
two bulls with horns standing next to each other	 
two bulls with horns standing next to each other	 
two bulls with horns standing next to each other

Visual re-ranked beam search = 5

two bulls standing next to each other 
a couple of bulls standing next to each other 
two bulls with horns standing next to each other 
two long horn bulls standing next to each other 
a longhorn cow with horns standing in a field 

We re-ranked the best 5 beams from 20 candidates captions, generated by the baseline, using the visual context information.

Visual Re-ranking with Negative Evidence

Until now, following the same concept we considered only the cases when the visual context increase the belief of the hypothesis. However the same work proposed another idea for the case where the absence of evidence leads to a decrease the hypothesis probability.

<!-- <img src="https://render.githubusercontent.com/render/math?math=\text{P}(w \mid \neg c)=1-(1-\mathrm{P}(w))^{\alpha}"> -->

$\text{P}(w \mid \neg c)=1-(1-\mathrm{P}(w))^{\alpha}$

In our case, we tried to introduce negative evidence in two ways: (1) objects detected by the object classifier (e.g. ResNet152) with very low confidence and not present in the image and thus it can be used as negative evidence, and (2) using the objects detected with high confidence in the image as a query to pre-trained 840B GloVe and retrieve the close concepts, and then, we employ the retrieved objects are not detected/present in the image as negative evidence.

Example 1 with Negative Evidence

<img align="center" width="400" height="200" src="example.jpg">

In this example, we will use the second method with a Pre-trained GloVe vector to extract the negative information related to the visual context but not detected in the image.

conda create -n GloVe python=3.8 anaconda
conda activate GloVe
pip install gensim==4.1.0
python Negative_Evidence-model/similar_vector.py

The negative visual information is accident that will be used to decrease the hypothesis. Note that the negative visual information also needs to be initialized by common observations python model/LM-GPT-2.py

python Example_negtive_1/negative_evidence_SBERT.py --lm LM.txt --visNg negtive_visual.txt  --visNg_init negtive_visual_init.txt --c caption.txt

Baseline beam = 5

a city street filled with traffic at night       	 
a city street covered in snow at night	 
a city street covered in traffic at night time	 
a city street filled with traffic surrounded by tall buildings	 
a city street covered in traffic at night	 

Visual re-ranked beam search = 5 with negative evidence

a city street filled with traffic surrounded by tall buildings
a city street filled with traffic surrounded by snow
a city street filled with traffic and traffic lights
a city street filled with traffic at night
a city street at night with cars and street lights

We re-ranked the best 5 beams from 9 candidates captions, generated by the baseline, using a negative visual context information.

For more examples

Semantic-Diversity-Evaluation

Sentence-to-sentence semantic similarity for semantic diversity evaluation

Inspired by BERTscore , we propose sentence-to-sentence semantic similarity score to compare candidate captions with human references. We employ pre-trained Sentence-RoBERTa-L tuned for general STS task. SBERT-sts uses a siamese network to derive meaningful sentence embeddings that can be compared via cosine similarity.

For more detail and other diversity evaluation

Example

ModelcaptionBERTscoreSBERT-sts*Human subject
B-besttwo bulls with horns standing next to each other0.890.7516.7
B+VRtwo long horn bulls standing next to each other0.88<b>0.81</b><b>0.83</b>
Humana closeup of two red-haired bulls with long horns

(*) max(sim(ref_k, candidate caption)), k = 5 human references

To find the cosine score

python SBERT-caption-eval/SBERT_eval_demo.py --ref ref_demo.txt --hyp hyp-demo_BeamS.txt or hyp-demo_visual_re-ranked.txt

Cloze Probability based Belief Revision

Cloze probability is the probability of a given word that will be filled in a given context on a sentence completion task (last word).

The girl eats the **toast**  --> low probability 
The girl eats the **eggs** --> high probability 
cloze_prob('The girl eats the toast')
0.0004592319609081579

cloze_prob('The girl eats the eggs')
0.00436875504749275

For caption re-ranking

cloze_prob('a city street at night with cars and street lamps')
0.12100567314071672

cloze_prob('a city street filled with traffic and traffic lights')
0.40925383021394385

Cloze_Prob based Belife Revision

a city street at night with cars and street lamps 0.18269163340435274
a city street filled with traffic and traffic lights 3.0824517390664777e-16

The first sentence is more diverse and without any repetition of word traffic.

To run this,

python cloze_prob/model_coze.py  --c caption.txt --vis visual_context_label.txt --vis_prob visual_context_prob.txt 

Go here for more details and code

Other Task: Sentence Semantic Similarity

There are two advantages of using Belief Revision for sentence semantic similarity tasks:

For quick start Open In Collab, and for comparison with other similarity based approach Open In Collab

conda create --name BR-S  python=3.7
# test with sentence_transformers-2.2.0
pip install sentence_transformers 

run

python  sent_model/model_sent.py --sent sent.txt --context_sent context_sent.txt  --output score.txt

Example:

sent =  'Obama speaks to the media in Illinois' 
context_sentence =  'The president greets the press in Chicago'

The two sentences are related but not similar and the belief revision score captures the relatedness better than semantic similarity.

# SBERT cosine 
Cosine = 0.62272817

# belief_revision score 
belief_revision = 0.557584688720967

Citation

The details of this repo are described in the following paper. If you find this repo useful, please kindly cite it:

@article{sabir2022belief,
  title={Belief Revision based Caption Re-ranker with Visual Semantic Information},
  author={Sabir, Ahmed and Moreno-Noguer, Francesc and Madhyastha, Pranava and Padr{\'o}, Llu{\'\i}s},
  journal={arXiv preprint arXiv:2209.08163},
  year={2022}
}

Acknowledgement

The implementation of the Belief Revision Score relies on resources from <a href="https://github.com/simonepri/lm-scorer">lm-score</a>, <a href="https://github.com/huggingface/transformers">Huggingface Transformers</a>, and <a href="https://www.sbert.net/">SBERT</a>. We thank the original authors for their well organized codebase.