Awesome

Sim2Sim-VLNCE

Official implementation of the ECCV 2022 Oral paper: Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments

Jacob Krantz and Stefan Lee

[Project Page] [Paper]

Setup

This project is modified from the VLN-CE repository starting from this commit.

Initialize the project

git clone --recurse-submodules git@github.com:jacobkrantz/Sim2Sim-VLNCE.git
cd Sim2Sim-VLNCE

conda env create -f environment.yml
conda activate sim2sim

Install the latest version of Matterport3DSimulator

If you do not want to run experiments with known subgoal candidates, you can skip this install and remove code references to MatterSim.

Download the Matterport3D scene meshes

# run with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/
# Extract to: ./data/scene_datasets/mp3d/{scene}/{scene}.glb

download_mp.py must be obtained from the Matterport3D project webpage.

Download the Room-to-Room episodes in VLN-CE format (link)

gdown https://drive.google.com/uc?id=1T9SjqZWyR2PCLSXYkFckfDeIs6Un0Rjm
# Extract to: ./data/datasets/R2R_VLNCE_v1-3/{split}/{split}.json.gz

Download the ResNet image encoder

./scripts/download_caffe_models.sh
# this populates ./data/caffe_models/

Download the MP3D connectivity graphs

./scripts/download_connectivity.sh
# this populates ./connectivity/

Evaluating Recurrent-VLN-BERT Models

We evaluate a discrete VLN agent at various points of transfer to continuous environments. The two model components that enable this are the subgoal generation module and the navigation module, illustrated below:

This repository supports the following evaluations of Recurrent-VLN-BERT. The checkpoint to evaluate can be specified by appending EVAL_CKPT_PATH_DIR path/to/checkpoint.pth to the run command.

Known Subgoals

Known subgoals candidates come from the MP3D-Sim navigation graph, just like discrete VLN. The following experiments consider different policies for navigating to selected subgoals.

Teleportation: the discrete VLN task in Habitat

python run.py --exp-config sim2sim_vlnce/config/graph-teleport.yaml

Oracle policy: an A$^*$-based navigator

python run.py --exp-config sim2sim_vlnce/config/graph-oracle_policy.yaml

Local policy: a realistic map-and-plan navigator

python run.py --exp-config sim2sim_vlnce/config/graph-local_policy.yaml

Predicted Subgoals

Predicted subgoals from the subgoal generation module (SGM)

python run.py --exp-config sim2sim_vlnce/config/sgm-local_policy.yaml

inference for leaderboard submissions

python run.py \
  --run-type inference \
  --exp-config sim2sim_vlnce/config/sgm-local_policy-inference.yaml

All experiment configs are set for a GPU with 32GB of RAM. For smaller cards, consider reducing the field RL.POLICY.OBS_TRANSFORMS.RESNET_CANDIDATE_ENCODER.max_batch_size and IL.batch_size if necessary.

Training VLN Models

Training Recurrent-VLN-BERT should be done in that repository. Other panorama-based VLN agents could also be transferred with this Sim2Sim method but are not currently supported.

To train with 3D reconstruction image features, either download them from here (habitat-ResNet-152-places365.tsv) or generate them yourself:

# ~4.5 hours on a 32GB Tesla V100 GPU.
python scripts/precompute_features.py
  [-h]
  [--caffe-prototxt CAFFE_PROTOTXT]
  [--caffe-model CAFFE_MODEL]
  [--save-to SAVE_TO]
  [--connectivity CONNECTIVITY]
  [--scenes-dir SCENES_DIR]
  [--batch-size BATCH_SIZE]
  [--gpu-id GPU_ID]

By default, the exact same Caffe ResNet as used in MP3D-Sim is used. We use these features to train both the VLN agent and the SGM. They are a drop-in replacement to the image features captured in MP3D-Sim under the name ResNet-152-places365.tsv as described in that README.

Fine-Tuning in Continuous Environments

Collect trajectories of optimal SGM selections

python run.py \
  --run-type collect \
  --exp-config sim2sim_vlnce/config/collect_ftune_data.yaml

Fine-tune the VLN agent

python run.py \
  --run-type train \
  --exp-config sim2sim_vlnce/config/train_vln_ftune.yaml

Subgoal Generation Module (SGM)

We use the vln-sim2real-envs repository (specifically the /actions/ folder) to train the SGM. We use the 3D reconstruction image features described above and train with 360${^\circ}$ vision.

Model Downloads

VLN weights [zip]. Extracted format: ./data/models/{Model-Name}

VLN Model	Model Name	Descritption
1	`RecVLNBERT.pth`	Published weights from Recurrent-VLN-BERT
2	`RecVLNBERT_retrained.pth`	Weights when we retrained it ourselves
3	`RecVLNBERT-ce_vision.pth`	Trained with 3D reconstruction image features
4	`RecVLNBERT-ce_vision-tuned.pth`	Fine-tunes row 3 in VLN-CE (leaderboard model)

SGM weights [zip]. Extracted format: ./data/sgm_models/{Model-Name}

SGM Model	Model Name	Descritption
1	`sgm_sim2real.pth`	Published weights from VLN Sim2Real
2	`sgm_sim2sim.pth`	360$^{\circ}$ vision and 3D reconstruction image features

License

Our code is MIT licensed. Trained models are considered data derived from the Matterport3D scene dataset and are distributed according to the Matterport3D Terms of Use.

Related Works

1st Place Solutions for RxR-Habitat Vision-and-Language Navigation Competition Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang, Liang Wang, Jing Shao. arXiv 2022

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation Yicong Hong, Zun Wang, Qi Wu, Stephen Gould. CVPR 2022

Waypoint Models for Instruction-guided Navigation in Continuous Environments Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, Oleksandr Maksymets. ICCV 2021

Sim-to-Real Transfer for Vision-and-Language Navigation Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, Stefan Lee. CoRL 2021

Citing

@inproceedings{krantz2022sim2sim
  title={Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments},
  author={Krantz, Jacob and Lee, Stefan},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2022}
}