Home

Awesome

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

<p align="center"> <img src='readme.jpeg' align="center" height="200px"> </p>

PyTorch Code of the ECCV 2022 paper:

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation,
Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan

Introduction

Results

<p align="center"> <img src='results.jpeg' align="center" height="300px"> </p>

Requirements

pip install -r requirements.txt
sudo apt-get install libjsoncpp-dev libepoxy-dev libglm-dev libosmesa6 libosmesa6-dev libglew-dev

Installation

Build the Simulator with following instruction. The simulater is version v0.1 of Matterport3D Simulator..

mkdir build && cd build
cmake -DOSMESA_RENDERING=ON ..
make

Prepare datasets

Please follow the data preparation as Recurrent VLN-BERT

R2R Navigation benchmark evaluation and training

The MTVM models are initialized from PREVALENT (indicated by --vlnbert in the train_agent.bash file). Please download the pretrain model and place them under Prevalent/pretrained_model/ before training the MTVM models.

To train a model, run

bash run/train_agent.bash

To evaluate a model with a trained/ pretrained model, run

bash run/test_agent.bash

Download the trained network weights here.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{lin2021multimodal,
  title={Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation},
  author={Lin, Chuang and Jiang, Yi and Cai, Jianfei and Qu, Lizhen and Haffari, Gholamreza and Yuan, Zehuan},
  journal={arXiv preprint arXiv:2111.05759},
  year={2021}
}

Acknowledgments

This repo is based on Recurrent VLN-BERT. Thanks for their wonderful works.