Awesome

Visual Semantic Relatedness Dataset for Image Captioning

<img src="main.png" align="right" width="600"/>

Modern image captioning relies heavily on extracting knowledge, from images such as objects, to capture the concept of a static story in the image. In this paper, we propose a textual visual context dataset for image captioning, where the publicly available dataset COCO Captions (Lin et al., 2014) has been extended with information about the scene (such as objects in the image). Since this information has textual form, it can be used to leverage any NLP task, such as text similarity or semantic relation methods, into captioning systems, either as an end-to-end training strategy or a post-processing based approach.

This repository contains the implementation of the paper Visual Semantic Relatedness Dataset for Image Captioning.

News

Add v2 with recent SoTA model swinV2 classifier for both soft/hard-label visual_caption_cosine_score_v2 with person label (0.2, 0.3 and 0.4). Please refer to huggingface repository.

Overview
Visual semantic with BERT
Dataset
Visual semantic with pre-trained model
Evaluation
Citation

Overview

We enrich COCO-Captions with Textual Visual Context information. We use ResNet152, CLIP and Faster R-CNN to extract object information for each COCO-caption image. We use three filter approaches to ensure the quality of the dataset (1) Threshold: to filter out predictions where the object classifier is not confident enough, and (2) semantic alignment with semantic similarity to remove duplicated objects. (3) semantic relatedness score as Soft-Label: to guarantee the visual context and caption have strong relation, we use Sentence RoBERTa-sts to give a soft label via cosine similarity and then we use a threshold to annotate the final label (if th ≥ 0.2, 0.3, 0.4 then [1,0]). Finally, to take advantage of the overlapping between the visual context and the caption, and to extract global information from each visual, we use BERT followed by a shallow CNN (Kim, 2014) to estimate the visual relatedness score.

Quick Start

For a quick start please have a look at this project page and Demo

Dataset

Sample

VC1	VC2	VC3	human annoated caption
cheeseburger	plate	hotdog	a plate with a hamburger fries and tomatoes
bakery	dining table	website	a table having tea and a cake on it
gown	groom	apron	its time to cut the cake at this couples wedding

Download

Dowload Raw data with ID and Visual context -> original dataset with related ID caption train2014
Downlod Data with cosine score-> soft cosine lable with th 0.2, 0.3, 0.4 and 0.5 and hard-label
Dowload Overlaping visual with caption-> Overlap visual context and the human annotated caption
Download Dataset (tsv file) 0.0-> raw data with hard lable without cosine similairty and with threshold cosine sim degree of the relation beteween the visual and caption = 0.2, 0.3, 0.4
Download Dataset GenderBias-> man/woman replaced with person class label

Visual semantic with BERT-CNN

Fine-tune BERT on the created dataset.

Requirements

Tensorflow 1.15.0
Python 3.6

conda create -n BERT_visual python=3.6 anaconda
conda activate BERT_visual
pip install tensorflow==1.15.0
pip install --upgrade tensorflow_hub==0.7.0

Download BERT check point uncased_L-12_H-768_A-12

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
git clone https://github.com/gaphex/bert_experimental/

like this BERT-CNN/uncased_L-12_H-768_A-12 and BERT-CNN/bert_experimental

Download dataset

wget https://www.dropbox.com/s/dh38xibtjpohbeg/train_all.zip
unzip train_all.zip

for Training

parser.add_argument('--train',  default='train.tsv', help='beam serach', type=str,required=False)  
parser.add_argument('--num_bert_layer', default='12', help='truned layers', type=int,required=False)  
parser.add_argument('--batch_size', default='128', help='truned layers', type=int,required=False) 
parser.add_argument('--epochs', default='5', help='', type=int,required=False) 
parser.add_argument('--seq_len', default='64', help='', type=int,required=False) 
parser.add_argument('--CNN_kernel_size', default='3', help='', type=int,required=False) 
parser.add_argument('--CNN_filters', default='32', help='', type=int,required=False)

python BERT_CNN.py --train /train_0.4.tsv --epochs 5

for inference only, download pre-trained model

wget https://www.dropbox.com/s/ip7p0wiwkwvph5k/0.4_bert-cnn.zip
unzip 0.4_bert-cnn.zip

python eval.py --testset test_demo.tsv --model 0.4_bert-cnn/frozen_graph.pb

Example

Re-rank the most related caption to the image using the visual context information.

visual information, candidate caption (beam search)
standard poodle shopping cart footwear, a close up of shoes and a dog in a basket, 0.99774158
standard poodle shopping cart footwear, a brown teddy bear laying on top of a pair of shoes, 0.0621758029

Visual semantic with pre-trained model

<img align="right" width="350" height="130" src="Pre-trained.png">

Although this approach is proposed to take the advantage of the dataset (e.g. visual semantic model), we also investigate the use of out-of-the-box tools to estimate the relatedness score between the short text (i.e. caption) and its environmental visual context (we call it visual classifier).

For this we follow similarity to probability based approach but

we use only the cosine similarity from a pre-trained model and the top-3 averaged prob (confidence) from the object classifier as:

$\text{P}(w \mid c)=\text{}sim(w,c)^{\text{P}(c)}$ where the main components of the visual semantics re-ranker:

Simialrity/relatedness between the caption and the object context $\text{}sim(w,c)$

$\text{P}(c)$ is the classifier object confident in the image $\text{P}(w \mid \text{object})$

with Pre-trained SBERT

 python model.py --vis visual-context_label.txt --vis_prob visual-context_prob.txt --c caption.txt

Please refer to this repository for more information about pre-trained visual re-ranker probability from similarity

Evaluation

Download pycocoevalcap

pip install pycocoevalcap

Then run

python Evaluation/coco_eval.py --f Result_tune_BERT_0.4.json

For more evaluation (Lexical and Semantic Diversity)

Citation

The details of this repo are described in the following paper. If you find this repo useful, please kindly cite it:

@article{sabir2023visual,
  title={Visual Semantic Relatedness Dataset for Image Captioning},
  author={Sabir, Ahmed and Moreno-Noguer, Francesc and Padr{\'o}, Llu{\'\i}s},
  journal={arXiv preprint arXiv:2301.08784},
  year={2023}
}