Home

Awesome

KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation

<a href='https://xiangyangli-cn.github.io/'>Xiangyang Li</a> and Zihan Wang and Jiahao Yang and Yaowei Wang and Shuqiang Jiang

This repository is the official implementation of KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation

Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes. Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates. However, these representations are not efficient enough for an agent to navigate to the target location. As knowledge provides crucial information which is complementary to visible content, in this paper, we propose a knowledge enhanced reasoning model (KERM) to leverage knowledge to improve agent navigation ability. Specifically, we first retrieve facts for the navigation views from the constructed knowledge base. And than we build a knowledge enhanced reasoning network, containing purification, fact-aware interaction, and instruction-guided aggregation modules, to integrate the visual features, history features, instruction features, and fact features for action prediction. Extensive experiments are conducted on the REVERIE, R2R, and SOON datasets. Experimental results demonstrate the effectiveness of the proposed method.

Requirements

  1. Install Matterport3D simulators: follow instructions here.
export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH
  1. Install requirements:
conda create --name KERM python=3.8.0
conda activate KERM
pip install -r requirements.txt
  1. Download dataset from Dropbox, including processed annotations, features and pretrained models from VLN-DUET. Put the data in datasets directory.

  2. Download pretrained lxmert, and some files in directory bert-base can be downloaded from bert-base-uncased.

mkdir -p datasets/pretrained 
wget https://nlp.cs.unc.edu/data/model_LXRT.pth -P datasets/pretrained
  1. Download preprocessed data and features of KERM from Baidu Netdisk, including features of knowledge base (vg.json), annotations of retrieved facts (knowledge.json), cropped image features (clip_crop_image.hdf5), and annotations of VisualGenome dataset (vg_annotations). Put the kerm_data in datasets directory.

  2. Download trained KERM models from Baidu Netdisk.

Build knowledge base

The preprocessed knowledge data is provided, you can skip this part.

cd preprocess
python3 get_knowledge_base.py  # Build knowledge base from VisualGenome dataset (vg.json).
python3 get_fact_feature.py  # Get the features of knowledge base (vg.hdf5).
python3 get_crop_image_feature.py  # Get cropped image features (clip_crop_image.hdf5).
python3 retrieve_facts.py  # Retrieve knowledge facts for all visual regions (knowledge.json). 

Pretraining

Combine behavior cloning and auxiliary proxy tasks in pretraining:

cd pretrain_src
bash run_reverie.sh # (run_soon.sh, run_r2r.sh, run_r4r.sh)

Fine-tuning & Evaluation

Use pseudo interative demonstrator to fine-tune the model:

cd knowledge_nav_src
bash scripts/run_reverie.sh # (run_soon.sh, run_r2r.sh)

Citation

@InProceedings{Li2023KERM,
  author  = {Xiangyang Li and Zihan Wang and Jiahao Yang and Yaowei Wang and Shuqiang Jiang},
  title   = {{KERM: K}nowledge Enhanced Reasoning for Vision-and-Language Navigation},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  pages     = {2583-2592},
  year    = {2023},
  }

Acknowledgements

Our code is based on VLN-DUET, Xmodal-Ctx and CLIP (ViT-B/16). Thanks for their great works!