Awesome

Awesome Vision-and-Language Navigation

This repo keeps track of the recent advances in Vision-and-Language Navigation research. Please check out our ACL 2022 VLN survey paper for the catogerization approach and the detailed discussions of tasks, methods, and future directions: Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions.

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from natural language processing, computer vision, robotics, and machine learning communities. In this paper, we review contemporary studies in the emerging field of VLN, covering tasks, evaluation metrics, methods, etc. Through structured analysis of current progress and challenges, we highlight the limitations of current VLN and opportunities for future work. This paper serves as a thorough reference for the VLN research community.

Awesome Vision-and-Language Navigation

Datasets and Benchmarks

Initial Instruction

[R2R]: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
CVPR 2018 paper
[CHAI]: Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction
EMNLP 2018 paper
[LANI]: Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction
EMNLP 2018 paper
Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning
RSS 2018 paper
[RoomNav]: Building Generalizable Agents with a Realistic and Rich 3D Environment
arXiv 2018 paper
[EmbodiedQA]: Embodied Question Answering
CVPR 2018 paper
[IQA]: Iqa: Visual Question Answering in Interactive Environments
CVPR 2018 paper
[Room-for-Room] Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
ACL 2019 paper
[XL-R2R] Cross-Lingual Vision-Language Navigation
arXiv 2019 paper
[Touchdown]: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
CVPR 2019 paper
The Streetlearn Environment and Dataset
arXiv 2019 paper
Learning To Follow Directions in Street View
arXiv 2019 paper
[Room-Across-Room]: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding EMNLP 2020 paper
[VLNCE] Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
ECCV 2020 paper
[Retouchdown]: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View
Spatial Language Understanding Workshop 2020 paper
[REVERIE]: Remote Embodied Visual Referring Expression in Real Indoor Environments
CVPR 2020 paper
[ALFRED]: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
CVPR 2020 paper
[Landmark-RxR]: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision
NeurIPS 2021 paper
Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
ICRA 2021 [Project Page] [arXiv] [GitHub]
[Talk2Nav]: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory
IJCV 2021 paper
[Habitat-Matterport]: 1000 Large-scale 3D Environments for Embodied AI
Neurips 2021 paper
[SOON]: Scenario Oriented Object Navigation with Graph-based Exploration
CVPR 2021 paper
[ZInD]: Zillow Indoor Dataset: Annotated Floor Plans With 360o Panoramas and 3D Room Layouts
CVPR 2021 paper

Guidance

[VNLA]: Vision-based Navigation with Language-based Assistance via Imitation Learning with Indirect Intervention
CVPR 2019 paper
[HANNA]: Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning
EMNLP 2019 paper
[CEREALBAR]: Executing Instructions in Situated Collaborative Interactions
ACL 2019 paper
[Just Ask]: An Interactive Learning Framework for Vision and Language Navigation
AAAI 2020 paper

Dialog

[Talk the Walk]: Navigating New York City through Grounded Dialogue
arXiv 2018 paper
[CVDN]: Vision-and-Dialog Navigation
CoRL 2019 paper
Collaborative Dialogue in Minecraft
ACL 2019 paper
[RobotSlang]: The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation
CoRL 2020 paper
[TEACh]: Task-driven Embodied Agents that Chat
AAAI 2022 paper
[DialFRED]: Dialogue-enabled agents for embodied instruction following
RA-L 2022 paper
[Don't Copy the Teacher]: EMNLP 2022 paper
[AVDN]: Aerial Vision-and-Dialog Navigation
ACL 2023 paper

Evaluation

Here we introduce papers that includes new evaluation metrics.

On Evaluation of Embodied Navigation Agents
arXiv 2018 paper
Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
CVPR 2019 paper
Vision-and-Dialog Navigation
CoRL 2019 paper
Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
ACL 2019 paper
General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping
arXiv 2019 paper

Methods

Representation Learning

Pretraining

Robust Navigation with Language Pretraining and Stochastic Sampling
EMNLP 2019 paper
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
ECCV 2020 paper
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
ECCV 2020 paper
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
CVPR 2020 paper
Episodic Transformer for Vision-and-Language Navigation
ICCV 2021 paper
The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation
ICCV 2021 paper
A Recurrent Vision-and-Language BERT for Navigation
CVPR 2021 paper
SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation
CVPR 2021 paper
Airbert: In-domain Pretraining for Vision-and-Language Navigation
ICCV 2021 paper
NDH-Full: Learning and Evaluating Navigational Agents on Full-Length Dialogue
EMNLP 2021 paper

Semantic Understanding

Shifting the Baseline: Single Modality Performance on Visual Navigation & QA
ACL 2019 paper
Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
ACL 2019 paper
Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters
BMVC 2019 paper
Diagnosing the Environment Bias in Vision-and-Language Navigation
IJCAI 2020 paper
Object-and-Action Aware Model for Visual Language Navigation
ECCV 2020 paper
Diagnosing Vision-and-Language Navigation: What Really Matters
arXiv 2021 paper
Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression
CVPR 2021 paper
Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
IEEE CAS 2021 paper
SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments <br> ICPR, 2022 [Paper] [Website] [Video]
FILM: Following Instructions in Language with Modular Methods <br> ICLR 2022 [Paper] [Website] [Video] [Code]
Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue <br> EMNLP 2022 [Paper] [Video]

Graph Representation

Chasing Ghosts: Instruction Following as Bayesian State Tracking
NeurIPS 2019 paper
Language and Visual Entity Relationship Graph for Agent Navigation
NeurIPS 2020 paper
Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
NeurIPS 2020 paper
Topological Planning with Transformers for Vision-and-Language Navigation
CVPR 2021 paper

Memory-augmented Model

Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning
EMNLP 2019 paper
Vision-Dialog Navigation by Exploring Cross-modal Memory
CVPR 2020 paper
A Recurrent Vision-and-Language BERT for Navigation
CVPR 2021 paper
Scene-Intuitive Agent for Remote Embodied Visual Grounding
CVPR 2021 paper
History Aware Multimodal Transformer for Vision-and-Language Navigation
NeurIPS 2021 paper

Auxiliary Tasks

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
ICLR 2019 paper
Transferable Representation Learning in Vision-and-Language Navigation
ICCV 2019 paper
Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks
CVPR 2020 paper

Action Strategy Learning

Reinforcement Learning

Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
ECCV 2018 paper
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
CVPR 2019 paper
Vision-language navigation policy learning and adaptation
TPAMI 2020 paper
Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
ACL 2019 paper
General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping
arXiv 2019 paper
Perceive, transform, and act: Multi-modal attention networks for vision-and-language navigation
arXiv 2019 paper
From language to goals: Inverse reinforcement learning for vision-based instruction following.
arXiv 2019 paper
Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision
NeurIPS 2021 paper
Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
IEEE CAS 2021 paper

Exploration during Navigation

Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation
CVPR 2019 paper
Active Visual Information Gathering for Vision-Language Navigation
ECCV 2020 paper
Pathdreamer: A World Model for Indoor Navigation
ICCV 2021 paper

Navigation Planning

Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
ECCV 2018 paper
Chasing Ghosts: Instruction Following as Bayesian State Tracking
NeurIPS 2019 paper
Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule
ICLR 2020 papepr
Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation
EMNLP Findings 2020 paper
Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
ICRA 2021 [Project Page] [arXiv] [GitHub]
Waypoint Models for Instruction-guided Navigation in Continuous Environments
ICCV 2021 paper
Pathdreamer: A World Model for Indoor Navigation
ICCV 2021 paper
Neighbor-view Enhanced Model for Vision and Language Navigation
arXiv 2021 paper
Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments
EMNLP 2021 paper
One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones
arXiv 2022 paper

Asking for Help

CVDN: Vision-and-Dialog Navigation
CoRL 2019 paper
Learning when and what to ask: a hierarchical reinforcement learning framework
EMNLP 2019 paper
Just Ask:An Interactive Learning Framework for Vision and Language Navigation
AAAI 2020 paper
RMM: A Recursive Mental Model for Dialog Navigation
EMNLP Findings 2020 paper
Self-Motivated Communication Agent for Real-World Vision-Dialog Navigation
ICCV 2021 paper
TEACh: Task-driven Embodied Agents that Chat
arXiv 2021 paper
A Framework for Learning to Request Rich and Contextually Useful Information from Humans
arXiv 2021 paper

Data-centric Learning

Data Augmentation

Speaker-Follower Models for Vision-and-Language Navigation
NeurIPS 2018 paper
Multi-modal Discriminative Model for Vision-and-Language Navigation
SpLU&RoboNLP Workshop 2019 paper
Transferable Representation Learning in Vision-and-Language Navigation
ICCV 2019 paper
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
NAACL 2019 paper
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
NAACL 2019 paper
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling
ECCV 2020 paper
Counterfactual vision-and-language navigation: Unravelling the unseen
NeurIPS 2020 paper
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
EACL 2021 paper
Vision-Language Navigation with Random Environmental Mixup
ICCV 2021 paper
On the Evaluation of Vision-and-Language Navigation Instructions
EACL 2021 paper
EnvEdit: Environment Editing for Vision-and-Language Navigation CVPR 2022 paper
AIGeN: An Adversarial Approach for Instruction Generation in VLN CVPRW 2024 paper

Curriculum Learning

BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps
ACL 2020 paper
Curriculum Learning for Vision-and-Language Navigation
NeurIPS 2021 paper

Multitask Learning

Environment-agnostic Multitask Learning for Natural Language Grounded Navigation
ECCV 2020 paper
Embodied Multimodal Multitask Learning
IJCAI 2020 paper

Instruction Interpretation

Multi-View Learning for Vision-and-Language Navigation
arXiv 2020 paper
Sub-Instruction Aware Vision-and-Language Navigation
EMNLP 2020 paper
Look wide and interpret twice: Improving performance on interactive instructionfollowing tasks
arXiv 2021 paper

Prior Exploration

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
CVPR 2019 paper
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
NAACL 2019 paper
Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
ACL 2019 paper
Counterfactual Vision-and-Language Navigation: Unravelling the Unseen
NeurIPS 2020 paper
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling
CVPR 2020 paper
Topological Planning with Transformers for Vision-and-Language Navigation
CVPR 2021 paper
Rethinking the Spatial Route Prior in Vision-and-Language Navigation
arXiv 2021 paper

Related Areas

Using 2D MAPS environments

Learning to follow navigational directions
ACL 2010 paper
Learning to interpret natural language navigation instructions from observations
AAAI 2011 paper
Run through the streets: A new dataset and baseline models for realistic urban navigation
EMNLP 2019 paper

Using synthetic environments

Walk the talk: Connecting language, knowledge, and action in route instructions
AAAI 2006 paper
Learning to Interpret Natural Language Navigation Instructions from Observations
AAAI 2011 paper
Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight
PMLR 2020 paper

Visual Navigation

Target-driven visual navigation in indoor scenes using deep reinforcement learning
ICRA 2017 paper
Learning to navigate
MULEA 2019 paper
Learning to navigate in cities without a map
NeurIPS 2019 paper
Deep Learning for Embodied Vision Navigation: A Survey
arXiv 2021 paper
Self-Supervised Object Goal Navigation with In-Situ Finetuning <br> IROS 2023 paper video

If you find this repo useful for your research, please cite

@InProceedings{jing2022vln,
      title={Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions}, 
      author={Jing Gu and Eliana Stefani and Qi Wu and Jesse Thomason and Xin Eric Wang},
      booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)},
      year = {2022}
}