Home

Awesome

EnvEdit: Environment Editing for Vision-and-Language Navigation (CVPR 2022)

In Vision-and-Language Navigation (VLN), an agent needs to navigate through the environment based on natural language instructions. Due to limited available data for agent training and finite diversity in navigation environments, it is challenging for the agent to generalize to new, unseen environments. To address this problem, we propose EnvEdit, a data augmentation method that creates new environments by editing existing environments, which are used to train a more generalizable agent. Our augmented environments can differ from the seen environments in three diverse aspects: style, object appearance, and object classes. Training on these edit-augmented environments prevents the agent from overfitting to existing environments and helps generalize better to new, unseen environments. Empirically, on both the Room-to-Room and the multi-lingual Room-Across-Room datasets, we show that our proposed EnvEdit method gets significant improvements in all metrics on both pre-trained and non-pre-trained VLN agents, and achieves the new state-of-the-art on the test leaderboard. We further ensemble the VLN agents augmented on different edited environments and show that these edit methods are complementary.

<img src="./figures/intro_new.png" alt="intro image" width="500"/>

Environment Installation

  1. Follow instructions here to install Matterport3D simulators.

  2. Install requirements:

pip install -r python_requirements.txt

Data Preparation

  1. Pre-Computed Features:

For using EnvDrop as the base agent, pre-extracted visual features can be downloaded:

wget https://nlp.cs.unc.edu/data/envedit/features.zip

The name of the pre-computed features for the original environment: CLIP-ViT-B-patch_size-views.tsv

The name of pre-computed features for the edited environments: CLIP-ViT-B-patch_size-views-edit_env_name.tsv

For using HAMT as the base agent, pre-extracted visual features can be downloaded:

wget https://nlp.cs.unc.edu/data/envedit/features_HAMT.zip

The main differences between features for HAMT agent and features for EnvDrop agent lie in

(1) On Room-to-Room dataset, the features for HAMT are extracted with the visual backbone in pre-trained model (not fine-tuned on VLN task) released in HAMT. The visual backbone does not contain the last representation layer in CLIP. The features for EnvDrop agent are extracted with CLIP pretrained visual backbone with the last representation layer (mapped to 512 dimension).

(2) On Room-Across-Room dataset, the features for HAMT is exactly the same as the CLIP-ViT-B-32-edit_env_name.tsv features we used for EnvDrop agent.

  1. Access to Edited Environments:

If you want to access to edited environments, please first sign the Terms of Use agreement form download in here and cc'ed the email to us jialuli@cs.unc.edu. And we would share a download link.

  1. Download Room-to-Room (R2R) and Room-Across-Room (RxR) datasets.

If you want to work on base agent EnvDrop, download data following instructions here

If you want to work on base agent HAMT, download data following instructions here

Stage 1: Edited Environment Creation

  1. Create Environments with Style Transfer:

First, download the code from Style Augmentation under the ./style_transfer directory for style transfer. Please setup environments according to their instructions.

Next, use style transfer to create edited environments by running:

cd style_transfer
CUDA_VISIBLE_DEVICES=0 python style_transfer.py --input views_img --output views_img_style_transfer

Modify the input path and output path as needed.

  1. Create Environments with Image Synthesis:
cd image_synthesis

Matterport3D dataset provides semantic segmentation for the images in the dataset. First, we match the semantic segmentation RGB value to sementic segmentation classes:

python transfer_semantics.py --input views_sem_image --output views_sem_image_transferred

Modify the input path and output path as needed.

Then, create a file list that contains all images in the seen environments. We provide files_img.list and files_sem.list for conveniency. You need to move them to original image directory and transferred semantic segmentation directory respectively and change name to files.list:

mv files_img.list ../../views_img/files.list
mv files_sem.list ../views_sem_image_transferred/files.list

Note that views_img is located outside the directory of EnvEdit, and views_sem_image_transferred is located under the directory of EnvEdit (outside image_synthesis directory). You need to manually change the content in files_img.list and files_sem.list if your views_img and views_sem_image_transferred are located elsewhere.

Then, we use the SPADE code(with small modifications) for creating edited environments with image synthesis. Please setup environments according to their instructions.

To train SPADE to VLN dataset:

bash scripts/train.bash

To generate edited environments with trained model:

bash scripts/test.bash

Stage 2: VLN Navigation Agent Training

Train the agent on both the original environment and the edited enviornment:

Base Agent EnvDrop:

bash run/agent.bash 0

Base Agent HAMT:

cd hamt_src
bash scripts/run_r2r.bash
bash scripts/run_rxr.bash

Stage 3: Back Translation with Style-aware Speaker

Train the style-aware speaker with:

bash run/speaker.bash 0

Back Translation with style-aware speaker:

bash run/bt.bash 0

Citation

If you find this work useful, please consider citing:

@inproceedings{li2022envedit,
  title     = {EnvEdit: Environment Editing for Vision-and-Language Navigation},
  author    = {Jialu Li, Hao Tan, Mohit Bansal},
  booktitle = {CVPR},
  year      = {2022}
}

Acknowledgement:

We thank the developers of EnvDrop, HAMT, Style Augmentation, SPADE for their public code release.