Awesome
Belief Revision Score
<img align="right" width="600" height="200" src="overview.png"> In this work, we focus on improving the captions generated by image-caption generation systems. We propose a novel re-ranking approach that leverages visual-semantic measures to identify the ideal caption that maximally captures the visual information in the image. Our re-ranker utilizes the Belief Revision framework (Blok et al., 2003) to calibrate the original likelihood of the top-n captions by explicitly exploiting the semantic relatedness between the depicted caption and the visual context. Our experiments demonstrate the utility of our approach, where we observe that our re-ranker can enhance the performance of a typical image-captioning system without the necessity of any additional training or fine-tuning. <br/> <br/>This repository contains the implementation of the paper Belief Revision based Caption Re-ranker with Visual Semantic Information
Contents
- Overview
- Visual Re-ranking with Belief Revision
- Dataset
- Model
- Visual Re-ranking with Negative Evidence
- Semantic Diversity Evaluation
- Cloze Prob based Belife Revision
- Other Task: Sentence Semantic Similarity
- Citation
Visual Re-ranking with Belief Revision
The Belief Revision is a conditional probability model which assumes that the preliminary probability finding is revised to the extent warranted by the hypothesis proof (in this work, the hypothesis proof is the visual context information from the image to revise and select the most related candidate caption that correlates directly with the image). The Belief Revision Score is written as:
<!-- <img src="https://render.githubusercontent.com/render/math?math=\text{P}(w \mid c)=\text{P}(w)^{\alpha}"> -->$\text{P}(w \mid c)=\text{P}(w)^{\alpha}$
where the main components of hypothesis revision as caption visual semantics re-ranker:
<!-- 1. Hypothesis (caption candidates beam search) <img src="https://render.githubusercontent.com/render/math?math=\text{P}(w)"> initialized by common observation (_i.e._ language model) -->- Hypothesis (caption candidates beam search) $\text{P}(w)$ initialized by common observation (i.e. language model)
- Informativeness $1-\text{P}(c)$ of the visual context from the image
-
Similarities $\alpha=\left[\frac{1 - \text{sim}(w, c)}{1+\text{sim}(w, c)}\right]^{1-\text{P}(c)}$ the relatedness between the two concepts (visual context and hypothesis) with respect to the informativeness of the visual information.
Here is a Gradio_Demo / Gradio_Demo_with_hypothesis to show the Visual Re-ranking based Belief Revision
Dataset
We enrich COCO-caption with Textual Visual Context information. We use out-of-the-box visual classifiers to extract object information for each COCO-caption image.
VC1 | VC2 | VC3 | human annoated caption |
---|---|---|---|
cheeseburger | plate | hotdog | a plate with a hamburger fries and tomatoes |
bakery | dining table | website | a table having tea and a cake on it |
gown | groom | apron | its time to cut the cake at this couples wedding |
More information about the visual context extraction paper
Model
Here, we describe in more detail the implementation of belief revision as a visual re-ranker. We show that by integrating visual context information, a more descriptive caption is re-ranked higher. Our model can be used as a drop-in complement for any caption generation algorithm that outputs a list of candidate captions (e.g. beam search, nucleus sampling, etc.).
To run the Belief Revision-Score via visual context directly with GPT-2 ans SRoBERTa-sts
conda create --name BRscore python=3.7
source activate BRscore
# tested with sentence_transformers-2.2.0
pip install sentence_transformers
run slide demo with GPT-2+SRoBERTa with result in Belief-revision_re-rank.txt
. For the huggingface Gradio_Demo
python model.py --c caption_demo.txt --vis visual_context_label_demo.txt --vis_prob visual_context_prob_demo.txt
Also interactive demo with huggingface-gardio by running this code or colab here
pip install gradio
python demo.py
To run each step separately, which gives you the flexibility to use different SoTA model (or your custom model)
First, we need to initialize the hypothesis with common observation i.e. lanaguage model (GPT2)
conda create -n LM-GPT python=3.7 anaconda
conda activate LM-GPT
pip install lm-scorer
python model/LM-GPT-2.py
Second, we need the visual context from the image, and thus we need visual classifiers
python model/run-visual.py
Finally, the relatedness between the two concepts (visual context and hypothesis)
Using fine-tuning BERT fine-tuning BERT
python BERT/train_model_VC.py
Or general-purpose SBERT with cosine similarity
conda create -n SBERT python=3.7 anaconda
conda activate SBERT
pip install sentence-transformers
python model/SBERT_model_VC.py
Then run demo Example 1/2 (below)
python model/Example_1/model.py --lm LM.txt --vis visual_context_lable.txt --vis_prob visual_context_prob.txt --c caption.txt
Note that each score is computed separately here (each score is in a separate file)
Go here for more details
Demo
Here are examples with the SBERT based model.
Example 1
<img align="center" width="400" height="200" src="example.jpg">Baseline beam = 5
a city street filled with traffic at night
a city street covered in snow at night
a city street covered in traffic at night time
a city street filled with traffic surrounded by tall buildings
a city street covered in traffic at night
Visual re-ranked beam search = 5
a city street filled with traffic surrounded by tall buildings
a city street filled with traffic and traffic lights
a city street filled with traffic surrounded by snow
a city street filled with traffic at night
a city street at night with cars and street lights
We re-ranked the best 5 beams from 9 candidates captions, generated by the baseline, using the visual context information.
Example 2
<img align="center" width="400" height="200" src="example-2.jpg">Baseline beam = 5
a longhorn cow with horns standing in a field
two bulls standing next to each other
two bulls with horns standing next to each other
two bulls with horns standing next to each other
two bulls with horns standing next to each other
Visual re-ranked beam search = 5
two bulls standing next to each other
a couple of bulls standing next to each other
two bulls with horns standing next to each other
two long horn bulls standing next to each other
a longhorn cow with horns standing in a field
We re-ranked the best 5 beams from 20 candidates captions, generated by the baseline, using the visual context information.
Visual Re-ranking with Negative Evidence
Until now, following the same concept we considered only the cases when the visual context increase the belief of the hypothesis. However the same work proposed another idea for the case where the absence of evidence leads to a decrease the hypothesis probability.
<!-- <img src="https://render.githubusercontent.com/render/math?math=\text{P}(w \mid \neg c)=1-(1-\mathrm{P}(w))^{\alpha}"> -->$\text{P}(w \mid \neg c)=1-(1-\mathrm{P}(w))^{\alpha}$
In our case, we tried to introduce negative evidence in two ways: (1) objects detected by the object classifier (e.g. ResNet152) with very low confidence and not present in the image and thus it can be used as negative evidence, and (2) using the objects detected with high confidence in the image as a query to pre-trained 840B GloVe and retrieve the close concepts, and then, we employ the retrieved objects are not detected/present in the image as negative evidence.
Example 1 with Negative Evidence
<img align="center" width="400" height="200" src="example.jpg">In this example, we will use the second method with a Pre-trained GloVe vector to extract the negative information related to the visual context but not detected in the image.
conda create -n GloVe python=3.8 anaconda
conda activate GloVe
pip install gensim==4.1.0
python Negative_Evidence-model/similar_vector.py
The negative visual information is accident
that will be used to decrease the hypothesis. Note that the negative visual information also needs to be initialized by common observations python model/LM-GPT-2.py
python Example_negtive_1/negative_evidence_SBERT.py --lm LM.txt --visNg negtive_visual.txt --visNg_init negtive_visual_init.txt --c caption.txt
Baseline beam = 5
a city street filled with traffic at night
a city street covered in snow at night
a city street covered in traffic at night time
a city street filled with traffic surrounded by tall buildings
a city street covered in traffic at night
Visual re-ranked beam search = 5 with negative evidence
a city street filled with traffic surrounded by tall buildings
a city street filled with traffic surrounded by snow
a city street filled with traffic and traffic lights
a city street filled with traffic at night
a city street at night with cars and street lights
We re-ranked the best 5 beams from 9 candidates captions, generated by the baseline, using a negative visual context information.
For more examples
Semantic-Diversity-Evaluation
Sentence-to-sentence semantic similarity for semantic diversity evaluation
Inspired by BERTscore , we propose sentence-to-sentence semantic similarity score to compare candidate captions with human references. We employ pre-trained Sentence-RoBERTa-L tuned for general STS task. SBERT-sts uses a siamese network to derive meaningful sentence embeddings that can be compared via cosine similarity.
For more detail and other diversity evaluation
Example
Model | caption | BERTscore | SBERT-sts* | Human subject |
---|---|---|---|---|
B-best | two bulls with horns standing next to each other | 0.89 | 0.75 | 16.7 |
B+VR | two long horn bulls standing next to each other | 0.88 | <b>0.81</b> | <b>0.83</b> |
Human | a closeup of two red-haired bulls with long horns |
(*) max(sim(ref_k, candidate caption)), k = 5 human references
To find the cosine score
python SBERT-caption-eval/SBERT_eval_demo.py --ref ref_demo.txt --hyp hyp-demo_BeamS.txt or hyp-demo_visual_re-ranked.txt
Cloze Probability based Belief Revision
Cloze probability is the probability of a given word that will be filled in a given context on a sentence completion task (last word).
The girl eats the **toast** --> low probability
The girl eats the **eggs** --> high probability
cloze_prob('The girl eats the toast')
0.0004592319609081579
cloze_prob('The girl eats the eggs')
0.00436875504749275
For caption re-ranking
cloze_prob('a city street at night with cars and street lamps')
0.12100567314071672
cloze_prob('a city street filled with traffic and traffic lights')
0.40925383021394385
Cloze_Prob based Belife Revision
a city street at night with cars and street lamps 0.18269163340435274
a city street filled with traffic and traffic lights 3.0824517390664777e-16
The first sentence is more diverse and without any repetition of word traffic.
To run this,
python cloze_prob/model_coze.py --c caption.txt --vis visual_context_label.txt --vis_prob visual_context_prob.txt
Go here for more details and code
Other Task: Sentence Semantic Similarity
There are two advantages of using Belief Revision for sentence semantic similarity tasks:
-
Belief_revision_score balances the high similarity score using human-inspired logic understanding. The similarity cosine distance alone is not a reliable score in some scenarios as it measures the angle between vectors in the semantic space.
-
The output is a probability, so it can be re-ranked or combined with another score or classifier (e.g. Products of Experts). (Note that, with the cosine distance is not feasible).
For quick start , and for comparison with other similarity based approach
conda create --name BR-S python=3.7
# test with sentence_transformers-2.2.0
pip install sentence_transformers
run
python sent_model/model_sent.py --sent sent.txt --context_sent context_sent.txt --output score.txt
Example:
sent = 'Obama speaks to the media in Illinois'
context_sentence = 'The president greets the press in Chicago'
The two sentences are related but not similar and the belief revision score captures the relatedness better than semantic similarity.
# SBERT cosine
Cosine = 0.62272817
# belief_revision score
belief_revision = 0.557584688720967
Citation
The details of this repo are described in the following paper. If you find this repo useful, please kindly cite it:
@article{sabir2022belief,
title={Belief Revision based Caption Re-ranker with Visual Semantic Information},
author={Sabir, Ahmed and Moreno-Noguer, Francesc and Madhyastha, Pranava and Padr{\'o}, Llu{\'\i}s},
journal={arXiv preprint arXiv:2209.08163},
year={2022}
}
Acknowledgement
The implementation of the Belief Revision Score relies on resources from <a href="https://github.com/simonepri/lm-scorer">lm-score</a>, <a href="https://github.com/huggingface/transformers">Huggingface Transformers</a>, and <a href="https://www.sbert.net/">SBERT</a>. We thank the original authors for their well organized codebase.