Awesome

Multimodal Transformer for Unaligned Multimodal Language Sequences

Pytorch implementation for learning Multimodal Transformer for unaligned multimodal language sequences.

Correspondence to:

Yao-Hung Hubert Tsai (yaohungt@cs.cmu.edu)
Shaojie Bai (shaojieb@andrew.cmu.edu)

Paper

Multimodal Transformer for Unaligned Multimodal Language Sequences<br> Yao-Hung Hubert Tsai *, Shaojie Bai *, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov<br> Association for Computational Linguistics (ACL), 2019. (*equal contribution)

Please cite our paper if you find our work useful for your research:

@inproceedings{tsai2019MULT,
  title={Multimodal Transformer for Unaligned Multimodal Language Sequences},
  author={Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J. Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan},
  booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month = {7},
  year={2019},
  address = {Florence, Italy},
  publisher = {Association for Computational Linguistics},
}

Overview

Overall Architecture for Multimodal Transformer

Multimodal Transformer (MulT) merges multimodal time-series via a feed-forward fusion process from multiple directional pairwise crossmodal transformers. Specifically, each crossmodal transformer serves to repeatedly reinforce a target modality with the low-level features from another source modality by learning the attention across the two modalities' features. A MulT architecture hence models all pairs of modalities with such crossmodal transformers, followed by sequence models (e.g., self-attention transformer) that predicts using the fused features.

Crossmodal Attention for Two Sequences from Distinct Modalities

The core of our proposed model are crossmodal transformer and crossmodal attention module.

Usage

Prerequisites

Python 3.6/3.7
Pytorch (>=1.0.0) and torchvision
CUDA 10.0 or above

Datasets

Data files (containing processed MOSI, MOSEI and IEMOCAP datasets) can be downloaded from here.

I personally used command line to download everything:

wget https://www.dropbox.com/sh/hyzpgx1hp9nj37s/AADfY2s7gD_MkR76m03KS0K1a/Archive.zip?dl=1
mv 'Archive.zip?dl=1' Archive.zip
unzip Archive.zip

To retrieve the meta information and the raw data, please refer to the SDK for these datasets.

Run the Code

Create (empty) folders for data and pre-trained models:

mkdir data pre_trained_models

and put the downloaded data in 'data/'.

Command as follows

python main.py [--FLAGS]

Note that the defualt arguments are for unaligned version of MOSEI. For other datasets, please refer to Supplmentary.

If Using CTC

Transformer requires no CTC module. However, as we describe in the paper, CTC module offers an alternative to applying other kinds of sequence models (e.g., recurrent architectures) to unaligned multimodal streams.

If you want to use the CTC module, plesase install warp-ctc from here.

The quick version:

git clone https://github.com/SeanNaren/warp-ctc.git
cd warp-ctc
mkdir build; cd build
cmake ..
make
cd ../pytorch_binding
python setup.py install
export WARP_CTC_PATH=/home/xxx/warp-ctc/build

Acknowledgement

Some portion of the code were adapted from the fairseq repo.