Home

Awesome

Multimodal Transformers

This code runs inference with the multimodal transformer models described in "Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers". Our models can be used to score if an image-text pair match. Please see our paper for more details. This code release consists of a colab to extract image and language features and input them into our transformer models. Transformer models are stored on tfhub.

Please see the tables below for details of models which we have released via tfhub.

NameTraining DatasetITMMRMMLMHeadsLayersAtt. TypeFineTunedNotes
data_cc (base)Conceptual CaptionsClassificationYY126MergedN
data_sbuSBUClassificationYY126MergedN
data_vgVisual GenomeClassificationYY126MergedN
data_mscocoMSCOCOClassificationYY126MergedN
data_mscoco-narrativesMSCOCO NarrativesClassificationYY126MergedN
data_oi-narrativesOI NarrativesClassificationYY126MergedN
data_combined-instanceAll (instance sampling)ClassificationYY126MergedN
data_combined-datasetAll (dataset sampling)ClassificationYY126MergedN
data_uniter-instanceUniter datasets (instance sampling)ClassificationYY126MergedN
data_uniter-datasetUniter datasets (dataset sampling)ClassificationYY126MergedN
data_cc-with-bertConceptual CaptionsClassificationYY126MergedNLanguage initialised with BERT
loss_itm_mrmConceptual CaptionsClassificationYN126MergedN
loss_itm_mlmConceptual CaptionsClassificationNY126MergedN
loss_single-modality-contrastive32Conceptual CaptionsContrastiveYY126Sing. ModalityN
loss_single-modality-contrastive1024Conceptual CaptionsContrastiveYY126Sing. ModalityN
loss_v1-contrastive32Conceptual CaptionsContrastiveYY121MergedN
architecture_heads1-768Conceptual CaptionsClassificationYY16MergedN
architecture_heads3-256Conceptual CaptionsClassificationYY36MergedN
architecture_heads6-64Conceptual CaptionsClassificationYY66MergedN
architecture_heads18-64Conceptual CaptionsClassificationYY186MergedN
architecture_vilbert-1blockConceptual CaptionsClassificationYY121MergedN
architecture_vilbert-2blockConceptual CaptionsClassificationYY122MergedN
architecture_vilbert-4blockConceptual CaptionsClassificationYY124MergedN
architecture_vilbert-12blockConceptual CaptionsClassificationYY1212MergedN
architecture_single-modalityConceptual CaptionsClassificationYY126Sing. ModalityN
architecture_mixed-modalityConceptual CaptionsClassificationYY126Mix ModalityN5 single modality layers and 1 merged layer
architecture_single-streamConceptual CaptionsClassificationYY126Single StreamN
architecture_language-q-12Conceptual CaptionsClassificationYY126Asymmetric (language)N
architecture_image-q-12Conceptual CaptionsClassificationYY126Asymmetric (image)N
architecture_language-q-24Conceptual CaptionsClassificationYY246Asymmetric (language)N
architecture_image-q-24Conceptual CaptionsClassificationYY246Asymmetric (image)N
architecture_single-modality-hlossConceptual CaptionsClassificationYY126Single modalityNIncludes ITM loss after every layer
data-ft_sbuSBUClassificationYY126MergedY
data-ft_vgVisual GenomeClassificationYY126MergedY
data-ft_mscocoMSCOCOClassificationYY126MergedY
data-ft_mscoco-narrativesMSCOCO NarrativesClassificationYY126MergedY
data-ft_oi-narrativesOI NarrativesClassificationYY126MergedY
data-ft_ccConceptual CaptionsClassificationYY126MergedY
data-ft_combined-instanceAll (instance sampling)ClassificationYY126MergedY
data-ft_combined-datasetAll (dataset sampling)ClassificationYY126MergedY
data-ft_uniter-instanceUniter datasets (instance sampling)ClassificationYY126MergedY
data-ft_uniter-datasetUniter datasets (dataset sampling)ClassificationYY126MergedY
architecture-ft_single-modalityConceptual CaptionsClassificationYY126Sing. ModalityY
architecture-ft_single-streamConceptual CaptionsClassificationYY126Single StreamY
architecture-ft_language-q-12Conceptual CaptionsClassificationYY126Asymmetric (language)Y
architecture-ft_image-q-12Conceptual CaptionsClassificationYY126Asymmetric (image)Y
architecture-ft_language-q-24Conceptual CaptionsClassificationYY246Asymmetric (language)Y
architecture-ft_image-q-24Conceptual CaptionsClassificationYY246Asymmetric (image)Y

In addition to our transformer models, we also release our baseline models. See details of our baseline models in the chart below:

NameITMBert InitialisationFineTuned
baseline_baselineContrastiveYesN
baseline_baseline-clsClassificationNoN
baseline_baseline-no-bert-transferContrastiveNoN
baseline-ft_baselineContrastiveYesY
baseline-ft_baseline-clsClassificationNoY
baseline-ft_baseline-no-bert-transferContrastiveNo

Installation

You do not need to install anything! You should be able to run all code from our released colab.

Usage

You can run an image and text pair through our module and see if the image and text pair match.

import tensorflow.compat.v1 as tf import tensorflow_hub as hub
m =
hub.Module('https://tfhub.dev/deepmind/mmt/architecture-ft_image-q-12/1')

Inference:

output = model.signatures['default'](**inputs)
score = tf.nn.softmax(output['output']).numpy()[0]

where score indicates if an image-text pair match (1 indicates a perfect match). Inputs is a dictionary with the following keys:

Please see our colab linked for details on pre-processing. You will need to use the detector released in our colab for good results.

Citing this work

If you use this model in your research please cite:

[1] Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, and Aida Nematzadeh. Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, TACL 2021.

Disclaimer

This is not an official Google product.