Home

Awesome

Visual-semantic-embedding

Pytorch Code for the image-sentence ranking methods from Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models (Kiros,Salakhutdinov, Zemel, 2014).

Images and sentences are mapped into a common vector space, where the sentence representation is computed using LSTM. This project contains training code and pre-trained models for Flick8K,flickr30K and MSCOCO.

  1. Thanks to ryankiros's implementation of the model in Theano: https://github.com/ryankiros/visual-semantic-embedding

Results

Below is a table of results obtained using the code from this repository, comparing the numbers reported in paper. aR@K is the Recall@K for image annotation (higher is better), while sR@K is the Recall@K for image search (higher is better). Medr is the median rank of the closest ground truth (lower is better).

All experiments below train on GTX TITAN (1 card).

Flickr8K

learning_rate: 0.001, batch_size: 128, validFreq: 100, dim&dim_word: 1000

MethodaR@1aR@5aR@10aMedrsR@1sR@5sR@10sMedr
Paper18.040.955.0812.537.051.510
This project23.949.161.3616.941.854.49

Flickr30K

learning_rate: 0.01, batch_size: 200, validFreq: 100, dim&dim_word: 1000

MethodaR@1aR@5aR@10aMedrsR@1sR@5sR@10sMedr
Paper23.050.762.9516.842.056.58
This project29.057.767.3421.548.059.06

Cost 2G GPU memory and 485s.

MSCOCO

learning_rate: 0.01, batch_size: 300, validFreq: 100, dim&dim_word: 1000

MethodaR@1aR@5aR@10aMedrsR@1sR@5sR@10sMedr
This project38.172.784.7231.767.981.23

Cost 2G GPU memory and 1180s.

Dependencies

This code is written in python, To use it you will need:

$ virtualenv env
$ source env/bin/activate
$ pip install torch-0.1.11.post5-cp27-none-linux_x86_64.whl
$ pip install torchvision

Getting started

You will first need to download the dataset files and pre-trained models. These can be obtained by runing:

wget http://www.cs.toronto.edu/~rkiros/datasets/f8k.zip
wget http://www.cs.toronto.edu/~rkiros/datasets/f30k.zip
wget http://www.cs.toronto.edu/~rkiros/datasets/coco.zip
wget http://www.cs.toronto.edu/~rkiros/models/vse.zip

Each of the dataset files contains the captions as well as VGG features from the 19-layer model. Flickr8K comes with a pre-defined train/dev/test split, while for Flickr30K and MS COCO we use the splits produced by Andrej Karpathy. Note that the original images are not included with the dataset.

Training new models

Open test.py and specify the hyperparameters that you would like. Below we describe each of them in detail:

As the model trains, it will periodically evaluate on the development set (validFreq) and re-save the model each time performance on the development set increases. Generally you shouldn't need more than 15-20 epochs of training on any of the datasets. Once the models are saved, you can load and evaluate them in the same way as the pre-trained models.

Others

This project is a simplified version of my another project -- ImageTextRetrieval, which is used to retrieval architecture images and text.