Awesome

<div align="center"> <h1>🎇NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models</h1> <div> <a href='https://gengzezhou.github.io' target='_blank'>Gengze Zhou<sup>🍕</sup></a>; <a href='http://www.yiconghong.me' target='_blank'>Yicong Hong<sup>🌭</sup></a>; <a href='https://zunwang1.github.io' target='_blank'>Zun Wang<sup>🍔</sup></a>; <a href='https://eric-xw.github.io' target='_blank'>Xin Eric Wang<sup>🌮</sup></a>; <a href='http://www.qi-wu.me' target='_blank'>Qi Wu<sup>🍕</sup></a> </div> <sup>🍕</sup>AIML, University of Adelaide <sup>🌭</sup>Adobe Research <sup>🍔</sup>UNC Chapel Hill <sup>🌮</sup>University of California, Santa Cruz <br> <div> <a href='https://github.com/GengzeZhou/NavGPT-2' target='_blank'><img alt="Static Badge" src="https://img.shields.io/badge/NavGPT-v0.2-blue"></a> <a href='https://arxiv.org/abs/2407.12366' target='_blank'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a> <a href="https://github.com/salesforce/LAVIS"><img alt="Static Badge" src="https://img.shields.io/badge/Salesforce-LAVIS-blue?logo=salesforce"></a> </div> </div>

🍹 Abstract

Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a trend underscores the potential of LLMs to generalize navigational reasoning and diverse language understanding. However, a significant discrepancy in agent performance is observed when integrating LLMs in the Vision-and-Language navigation (VLN) tasks compared to previous downstream specialist models. Furthermore, the inherent capacity of language to interpret and facilitate communication in agent interactions is often underutilized in these integrations. In this work, we strive to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning. By aligning visual content in a frozen LLM, we encompass visual observation comprehension for LLMs and exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning. We demonstrate the data efficiency of the proposed methods and eliminate the gap between LM-based agents and state-of-the-art VLN specialists.

🍸 Method

🍻 TODOs

Release 🎇NavGPT-2 policy finetuning code.
Release visual instruction tuning code.
Release navigational reasoning data.
Release pretrained models weights.
Release data preparation scripts.

🧋 Prerequisites

🍭 Installation

Two ways are provided to set up the environment: Conda and Docker, can choose either one according to your preference.

Conda Environment

Create a conda environment and install all dependencies:

conda create --name NavGPT2 python=3.8
conda activate NavGPT2
pip install -r requirements.txt

Install Matterport3D simulator follow instructions here.

You could find some hints from the provided Dockerfile of how to build the simulator in conda environment :) .

Docker Environment

A Dockerfile is provided to build the environment with all dependencies installed. You can either pull the Docker image directly from Docker Hub or build it yourself using the provided Dockerfile.

OPTION 1: Pull the Docker image from Docker Hub:

docker pull gengzezhou/mattersim-torch2.2.0cu118:v2
docker run -it gengzezhou/mattersim-torch2.2.0cu118:v2 /bin/bash

Start a container and run the following lines to make sure you activate the environment:

source /root/miniconda3/etc/profile.d/conda.s
conda activate
export PYTHONPATH=/root/Matterport3DSimulator/build

OPTION 2: Build the Docker image from the provided Dockerfile:

docker build -t navgpt2:v1 .
docker run -it navgpt2:v1 /bin/bash

🍬 Data Preparation

Download the required data:

python download.py --data

This script will automatically download the following datasets:

R2R Data and Pre-computed Image Features (EVA-CLIP-g):

Source: Huggingface Datasets: ZGZzz/NavGPT-R2R

Destination: datasets
Instruction Tuning Data for NavGPT-2:

Source: Huggingface Datasets: ZGZzz/NavGPT-Instruct

Destination: datasets/NavGPT-Instruct

Unzip the downloaded R2R data:

cd datasets
cat R2R.zip.* > R2R.zip
unzip R2R.zip

The data directory is structed as follows:

datasets
├── NavGPT-Instruct
│   ├── NavGPT_train_v1.json
│   └── NavGPT_val_v1.json
└── R2R
    ├── annotations
    ├── connectivities
    └── features
        └── MP3D_eva_clip_g_can.lmdb

Alternatively, you can specify the datasets to download by providing the --dataset argument to the script. For example, to download only the R2R data:

python download.py \
    --data \
    --dataset 'r2r'

🍫 Pretrained Models

Download the pretrained models:

python download.py --checkpoints

This script will automatically download the following pretrained models:

<table border="1" width="100%"> <tr align="center"> <th>Model</th><th>Log</th><th colspan="5">R2R unseen</th><th colspan="5">R2R test</th> </tr> <tr align="center"> <td></td><td></td><td>TL</td><td>NE</td><td>OSR</td><td>SR</td><td>SPL</td><td>TL</td><td>NE</td><td>OSR</td><td>SR</td><td>SPL</td> </tr> <tr align="center"> <td><a href="https://huggingface.co/ZGZzz/NavGPT2-FlanT5-XL/tree/main">NavGPT2-FlanT5-XL</a></td><td><a href="assets/NavGPT2-FlanT5-XL.log">here</a></td><td>12.81</td><td>3.33</td><td>78.50</td><td>69.89</td><td>58.86</td><td>13.51</td><td>3.39</td><td>77.38</td><td>70.76</td><td>59.60</td> </tr> <tr align="center"> <td><a href="https://huggingface.co/ZGZzz/NavGPT2-FlanT5-XXL/tree/main">NavGPT2-FlanT5-XXL</a></td><td><a href="assets/NavGPT2-FlanT5-XXL.log">here</a></td><td>14.04</td><td>2.98</td><td>83.91</td><td>73.82</td><td>61.06</td><td>14.74</td><td>3.33</td><td>80.30</td><td>71.84</td><td>60.28</td> </tr> </table>

The checkpoints include the following files:

Pretrained NavGPT-2 Q-former weights, will be put in the map_nav_src/models/lavis/output directory.
Finetuned NavGPT-2 policy weights, will be put in the datasets/R2R/trained_models directory.

Alternatively, you can specify the models to download by providing the --model argument to the script. For example, to download only the NavGPT2-FlanT5-XL weights:

python download.py \
    --checkpoints \
    --model 'xl'

🧃 Stage 1: Visual Instruction Tuning of NavGPT-2

You could skip this stage and directly use the provided pretrained NavGPT-2 Q-former for policy finetuning.

Set the cache directory in defaults.yaml as the absolute path to NavGPT-2.

Perform visual instruction tuning of NavGPT-2 Q-former with FlanT5-xl:

cd map_nav_src/models
bash run_script/train_NavGPT_xl.sh

Alternatively, you can switch the LLM to FlanT5-xxl, Vicuna-7B, or Vicuna-13B by running the following scripts:

bash run_script/train_NavGPT_xxl.sh
bash run_script/train_NavGPT_7B.sh
bash run_script/train_NavGPT_13B.sh

The training logs and checkpoints will be saved in the models/lavis/output directory.

🍹 Stage 2: Policy Finetuning of NavGPT-2

Evaluate the trained NavGPT-2 policy with FlanT5-xl on the R2R dataset:

cd map_nav_src
bash scripts/val_r2r_xl.sh

Finetune and evaluate the NavGPT-2 policy with FlanT5-xl on the R2R dataset:

cd map_nav_src
bash scripts/run_r2r_xl.sh

This script will use the released instruction-tuned NavGPT-2 Q-former as initialization. The results will be saved in the map_nav_src/datasets/R2R/exprs_map/finetune directory.

Replace the --qformer_ckpt_path argument in the run_r2r_xl.sh script with the path to the desired NavGPT-2 Q-former checkpoint to finetune the policy with a different model.

Alternatively, you can switch the LLM to FlanT5-xxl, Vicuna-7B, or Vicuna-13B by running the following scripts:

bash scripts/run_r2r_xxl.sh
bash scripts/run_r2r_vicuna7b.sh
bash scripts/run_r2r_vicuna13b.sh

🥂 Acknowledgements

We extend our gratitude to MatterPort 3D for their valuable contributions to the open-source platform and community.

We also acknowledge the significant benefits of using DUET and InstructBLIP in this work. Our thanks go out to the creators of these outstanding projects.

🍺 Citation

If you find this work helpful, please consider citing:

@article{zhou2024navgpt2,
  title={NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models},
  author={Zhou, Gengze and Hong, Yicong and Wang, Zun and Wang, Xin Eric and Wu, Qi},
  journal={arXiv preprint arXiv:2407.12366},
  year={2024}
}