Awesome
Perceive, Transform, and Act
This is the PyTorch implementation of our paper:
Multimodal Attention Networks for Low-Level Vision-and-Language Navigation (paper)<br> Federico Landi, Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini, Rita Cucchiara<br> Computer Vision and Image Understanding (CVIU), 2021
Table of contents
- Installation
- Training and Testing
- Visualizing Navigation Episodes
- Reproducibility Note
- Citing
- License
- Acknowledgments
Installation
Our repository is based on the Matterport3D simulator, which was originally proposed with the Room-to-Room dataset.
As a first step, clone the repository and create the environment with conda:
git clone --recursive https://github.com/aimagelab/perceive-transform-and-act
cd perceive-transform-and-act
conda env create -f environment.yml
source activate pta
If you didn't clone with the --recursive
flag, then you'll need to manually clone the submodules from the top-level directory:
git submodule update --init --recursive
Building with Docker
Please follow the instructions on the Matterport3DSimulator repository to install the simulator via Docker.
Bulding without Docker
A C++ compiler with C++11 support is required. Matterport3D Simulator has several dependencies:
- Ubuntu >= 14.04
- Nvidia-driver with CUDA installed
- C++ compiler with C++11 support
- CMake >= 3.10
- OpenCV >= 2.4 including 3.x
- OpenGL
- GLM
- Numpy
Optional dependences (depending on the cmake rendering options):
If all of the dependecies are installed, you can build the simulator from source by tiping:
mkdir build
cd build
cmake -DOSMESA_RENDERING=ON -DPYTHON_EXECUTABLE:FILEPATH=`path/to/your/python/bin` ..
make
Precomputed ResNet Image Features
Download the precomputed ResNet-152 (imagenet) features, and place the corresponding .tsv file into the img_features
folder.
Training and Testing
To train PTA from scratch, move to the root directory and run:
python tasks/R2R/main.py --name train_from_scratch \
--plateau_sched \
--lr 1e-4 \
--max_episode_len 30
We also provide weights obtained with the training described in the paper. If you wish to reproduce the results in our paper, run:
python tasks/R2R/main.py --name test_ll \
--max_episode_len 30 \
--eval_only \
--pretrained \
--load_from low_level
Our agent can also perform high-level Vision-and-Language Navigation. To reproduce the results otained with the high-level setup, run:
python tasks/R2R/main.py --name test_hl \
--high_level \
--max_episode_len 10 \
--eval_only \
--pretrained \
--load_from high_level
Visualizing Navigation Episodes
To make our qualitative results easier to visualize, we provide some .gif files that display some of the navigation episodes reported in our paper. We also show meaningful metrics to evaluate our results.
Low-level VLN in R2R
<p> <img src="teaser/r2r_3.gif" width="420"> <img src="teaser/r2r_2.gif" width="420"> </p>Low-level VLN in R4R
<p> <img src="teaser/r4r_1.gif" width="420"> <img src="teaser/r4r_2.gif" width="420"> </p>Reproducibility Note
Our experiments were made using an Nvidia 1080Ti GPU, CUDA 10.0, and python 3.6.8. Using different hardware setups or software versions may affect results.
Citing
If you find our code useful for your research, please cite our paper:
Bibtex:
@article{landi2021multimodal,
title={Multimodal attention networks for low-level vision-and-language navigation},
author={Landi, Federico and Baraldi, Lorenzo and Cornia, Marcella and Corsini, Massimiliano and Cucchiara, Rita},
journal={Computer Vision and Image Understanding},
year={2021},
publisher={Elsevier}
}
License
PTA is MIT licensed. See the LICENSE file for details.
The trained models are considered data derived from the correspondent scene datasets.
Acknowledgments
This work has been supported by "Fondazione di Modena" and by the national project "IDEHA: Innovation for Data Elaboration in Heritage Areas" (PON ARS01_00421), cofunded by the Italian Ministry of University and Research.