Home

Awesome

BabyWalk: Going Farther in Vision-and-Language Navigationby Taking Baby Steps

<img src="teaser/pytorch-logo-dark.png" width="10%"> License: MIT

This is the PyTorch implementation of our paper:

BabyWalk: Going Farther in Vision-and-Language Navigationby Taking Baby Steps<br> Wang Zhu*, Hexiang Hu*, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, Fei Sha<br> 2020 Annual Conference of the Association for Computational Linguistics (ACL 2020)

[arXiv] [GitHub]

Abstract

Learning to follow instructions is of fundamental importance to autonomous agents for vision-and-language navigation (VLN). In this paper, we study how an agent can navigate long paths when learning from a corpus that consists of shorter ones. We show that existing state-of-the-art agents do not generalize well. To this end, we propose BabyWalk, a new VLN agent that is learned to navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially. A special design memory buffer is used by the agent to turn its past experiences into contexts for future steps. The learning process is composed of two phases. In the first phase, the agent uses imitation learning from demonstration to accomplish BabySteps. In the second phase, the agent uses curriculum-based reinforcement learning to maximize rewards on navigation tasks with increasingly longer instructions. We create two new benchmark datasets (of long navigation tasks) and use them in conjunction with existing ones to examine BabyWalk's generalization ability. Empirical results show that BabyWalk achieves state-of-the-art results on several metrics, in particular, is able to follow long instructions better. <br> <br> <img src="teaser/babywalk_curriculum.jpg" width="100%">

Installation

  1. Install Python 3.7 (Anaconda recommended: https://www.anaconda.com/distribution/).
  2. Install PyTorch following the instructions on https://pytorch.org/ (we used PyTorch 1.1.0 in our experiments).
  3. Download this repository or clone with Git, and then enter the root directory of the repository:
git clone https://github.com/Sha-Lab/babywalk
cd babywalk
  1. Check the installation of required packages in requirement.txt.
  2. Download and preprocess the data
chmod +x download.sh
./download.sh

After this step, check

Updates: The old link for the ResNet feature is expired. Please see here for the new link and the additional landmark alignment code.

Training and evaluation

Here we take training on R2R as an example, using BABYWALK.

Warmup with IL

CUDA_VISIBLE_DEVICES=0 python src/train_follower.py \
    --split_postfix "_landmark" \
    --task_name R2R \
    --n_iters 50000 \
    --model_name "follower_bbw" \
    --il_mode "landmark_split" \
    --one_by_one \
    --one_by_one_mode "landmark" \
    --history \
    --log_every 100

Training with CRL

CUDA_VISIBLE_DEVICES=0 python src/train_follower.py \
    --split_postfix "_landmark" \
    --task_name R2R \
    --n_iters 30000 \
    --curriculum_iters 5000 \
    --model_name "follower_bbw_crl" \
    --one_by_one \
    --one_by_one_mode "landmark" \
    --history \
    --log_every 100 \
    --reward \
    --reward_type "cls" \
    --batch_size 64 \
    --curriculum_rl \
    --max_curriculum 4 \
    --no_speaker \
    --follower_prefix "tasks/R2R/follower/snapshots/follower_bbw_sample_train_iter_30000"

Other baselines

Here we take training on R2R as an example, using Speaker-Follower and Reinforced Cross-modal Matching.

CUDA_VISIBLE_DEVICES=0 python src/train_follower.py \
    --task_name R2R \
    --n_iters 50000 \
    --model_name "follower_sf_aug" \
    --add_augment
CUDA_VISIBLE_DEVICES=0 python src/train_follower.py \
    --task_name R2R \
    --n_iters 20000 \
    --model_name "follower_sf" \
    --follower_prefix "tasks/R2R/follower/snapshots/best_model"
CUDA_VISIBLE_DEVICES=0 python src/train_follower.py \
    --task_name R2R \
    --n_iters 20000 \
    --model_name "follower_rcm_cls" \
    --reward \
    --reward_type "cls" \
    --batch_size 64 \
    --no_speaker \
    --follower_prefix "tasks/R2R/follower/snapshots/follower_sf_aug_sample_train-literal_speaker_data_augmentation_iter_50000"

Evaluation

Here we take model trained on R2R, using BABYWALK as an example. <br>

CUDA_VISIBLE_DEVICES=0 python src/val_follower.py \
    --task_name R2T8 \
    --split_postfix "_landmark" \
    --one_by_one \
    --one_by_one_mode "landmark" \
    --model_name "follower_bbw" \
    --history \
    --follower_prefix "tasks/R2R/follower/snapshots/best_model"

Download reported models in our paper

chmod +x download_model.sh
./download_model.sh

Performance comparison on SDTW

Models trained on R4R

ModelEval R2REval R4REval R6REval R8R
SF14.89.25.25.0
RCM(FIDELITY)18.313.77.96.1
REGRETFUL13.413.57.55.6
FAST14.215.57.76.3
BABYWALK27.817.313.111.5
BABYWALK(COGROUND)31.620.015.913.9

Models trained on R2R

ModelEval R2REval R4REval R6REval R8R
SF27.26.77.23.8
RCM(FIDELITY)34.47.28.44.3
REGRETFUL40.69.86.82.4
FAST45.47.28.52.4
BABYWALK36.913.811.29.8

Citation

Please citing the follow BibTex entry if you are using any content from this repository:

@inproceedings{zhu2020babywalk,
    title = "{B}aby{W}alk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps",
    author = "Zhu, Wang and Hu, Hexiang and Chen, Jiacheng and Deng, Zhiwei and Jain, Vihan and Ie, Eugene and Sha, Fei",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    pages = "2539--2556",
}