Home

Awesome

X-modaler

X-modaler is a versatile and high-performance codebase for cross-modal analytics (e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval). This codebase unifies comprehensive high-quality modules in state-of-the-art vision-language techniques, which are organized in a standardized and user-friendly fashion.

The original paper can be found here.

<p align="center"> <img src="images/task.jpg" width="800"/> </p>

Installation

See installation instructions.

Requiremenets

Getting Started

See Getting Started with X-modaler

Training & Evaluation in Command Line

We provide a script in "train_net.py", that is made to train all the configs provided in X-modaler. You may want to use it as a reference to write your own training script.

To train a model(e.g., UpDown) with "train_net.py", first setup the corresponding datasets following datasets, then run:

# Teacher Force
python train_net.py --num-gpus 4 \
 	--config-file configs/image_caption/updown.yaml

# Reinforcement Learning
python train_net.py --num-gpus 4 \
 	--config-file configs/image_caption/updown_rl.yaml

Model Zoo and Baselines

A large set of baseline results and trained models are available here.

<table> <tr> <td colspan="4" align="center"><font size=3><b>Image Captioning</b></font></td> </tr> <tr> <td>Attention</td> <td> Show, attend and tell: Neural image caption generation with visual attention </td> <td>ICML</td> <td>2015</td> </tr> <tr> <td>LSTM-A3</td> <td> Boosting image captioning with attributes </td> <td>ICCV</td> <td>2017</td> </tr> <tr> <td>Up-Down</td> <td> Bottom-up and top-down attention for image captioning and visual question answering </td> <td>CVPR</td> <td>2018</td> </tr> <tr> <td>GCN-LSTM</td> <td> Exploring visual relationship for image captioning </td> <td>ECCV</td> <td>2018</td> </tr> <tr> <td>Transformer</td> <td> Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning </td> <td>ACL</td> <td>2018</td> </tr> <tr> <td>Meshed-Memory</td> <td> Meshed-Memory Transformer for Image Captioning </td> <td>CVPR</td> <td>2020</td> </tr> <tr> <td>X-LAN</td> <td> X-Linear Attention Networks for Image Captioning </td> <td>CVPR</td> <td>2020</td> </tr> <tr> <td colspan="4" align="center"><font size=3><b>Video Captioning</b></font></td> </tr> <tr> <td>MP-LSTM</td> <td> Translating Videos to Natural Language Using Deep Recurrent Neural Networks </td> <td>NAACL HLT</td> <td>2015</td> </tr> <tr> <td>TA</td> <td> Describing Videos by Exploiting Temporal Structure </td> <td>ICCV</td> <td>2015</td> </tr> <tr> <td>Transformer</td> <td> Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning </td> <td>ACL</td> <td>2018</td> </tr> <tr> <td>TDConvED</td> <td> Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning </td> <td>AAAI</td> <td>2019</td> </tr> <tr> <td colspan="4" align="center"><font size=3><b>Vision-Language Pretraining</b></font></td> </tr> <tr> <td>Uniter</td> <td> UNITER: UNiversal Image-TExt Representation Learning </td> <td>ECCV</td> <td>2020</td> </tr> <tr> <td>TDEN</td> <td> Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network </td> <td>AAAI</td> <td>2021</td> </tr> </table>

Image Captioning on MSCOCO (Cross-Entropy Loss)

NameModelBLEU@1BLEU@2BLEU@3BLEU@4METEORROUGE-LCIDEr-DSPICE
LSTM-A3GoogleDrive75.359.045.435.026.755.6107.719.7
AttentionGoogleDrive76.460.646.936.127.656.6113.020.4
Up-DownGoogleDrive76.360.346.636.027.656.6113.120.7
GCN-LSTMGoogleDrive76.861.147.636.928.257.2116.321.2
TransformerGoogleDrive76.460.346.535.828.256.7116.621.3
Meshed-MemoryGoogleDrive76.360.246.435.628.156.5116.021.2
X-LANGoogleDrive77.561.948.337.528.657.6120.721.9
TDENGoogleDrive75.559.445.734.928.756.7116.322.0

Image Captioning on MSCOCO (CIDEr Score Optimization)

NameModelBLEU@1BLEU@2BLEU@3BLEU@4METEORROUGE-LCIDEr-DSPICE
LSTM-A3GoogleDrive77.961.546.735.027.156.3117.020.5
AttentionGoogleDrive79.463.548.937.127.957.6123.121.3
Up-DownGoogleDrive80.164.349.737.728.058.0124.721.5
GCN-LSTMGoogleDrive80.264.750.338.528.558.4127.222.1
TransformerGoogleDrive80.565.451.139.229.158.7130.023.0
Meshed-MemoryGoogleDrive80.765.551.439.629.258.9131.122.9
X-LANGoogleDrive80.465.251.039.229.459.0131.023.2
TDENGoogleDrive81.366.352.040.129.659.8132.623.4

Video Captioning on MSVD

NameModelBLEU@1BLEU@2BLEU@3BLEU@4METEORROUGE-LCIDEr-DSPICE
MP-LSTMGoogleDrive77.065.656.948.132.468.173.14.8
TAGoogleDrive80.468.960.151.033.570.077.24.9
TransformerGoogleDrive79.067.658.549.433.368.780.34.9
TDConvEDGoogleDrive81.670.461.351.734.170.477.85.0

Video Captioning on MSR-VTT

NameModelBLEU@1BLEU@2BLEU@3BLEU@4METEORROUGE-LCIDEr-DSPICE
MP-LSTMGoogleDrive73.660.849.038.626.058.341.15.6
TAGoogleDrive74.361.850.339.926.459.442.95.8
TransformerGoogleDrive75.462.350.039.226.558.744.05.9
TDConvEDGoogleDrive76.462.349.938.926.359.040.75.7

Visual Question Answering

NameModelOverallYes/NoNumberOther
UniterGoogleDrive70.186.853.759.6
TDENGoogleDrive71.988.354.362.0

Caption-based image retrieval on Flickr30k

NameModelR1R5R10
UniterGoogleDrive61.687.792.8
TDENGoogleDrive62.086.692.4

Visual commonsense reasoning

NameModelQ -> AQA -> RQ -> AR
UniterGoogleDrive73.075.355.4
TDENGoogleDrive75.076.557.7

License

X-modaler is released under the Apache License, Version 2.0.

Citing X-modaler

If you use X-modaler in your research, please use the following BibTeX entry.

@inproceedings{Xmodaler2021,
  author =       {Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, and Tao Mei},
  title =        {X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics},
  booktitle =    {Proceedings of the 29th ACM international conference on Multimedia},
  year =         {2021}
}