Awesome

<p align="center"> <h1 align="center">OpenFusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation </h1> <p align="center"> <a href="https://kashu7100.github.io/"><strong>Kashu Yamazaki</strong></a> · <a href=""><strong>Taisei Hanyu</strong></a> · <a href="https://vhvkhoa.github.io/"><strong>Khoa Vo</strong></a> · <a href="https://phamtrongthang123.github.io/"><strong>Thang Pham</strong></a> · <a href=""><strong>Minh Tran</strong></a> <br> <a href=""><strong>Gianfranco Doretto</strong></a> · <a href=""><strong>Anh Nguyen</strong></a> · <a href=""><strong>Ngan Le</strong></a> </p> <h4 align="center"><a href="https://arxiv.org/pdf/2310.03923.pdf">Paper</a> | <a href="https://arxiv.org/abs/2310.03923">arXiv</a> | <a href="https://uark-aicv.github.io/OpenFusion/">Project Page</a></h4> <div align="center"></div> </p> <p align="center"> <img src="assets/pipeline.png" width="80%"> </p>

TL;DR: Open-Fusion builds an open-vocabulary 3D queryable scene from a sequence of posed RGB-D images in real-time.

Getting Started 🏁

System Requirements

Ubuntu 20.04
10GB+ VRAM (~ 5 GB for SEEM and 2.5 GB ~ for TSDF) - for a large scene, it may require more memory
Azure Kinect, Intel T265 (for real-world data)

Environment Setup

Please build a Docker image from the Dockerfile. Do not forget to export the following environment variables (REGISTRY_NAME and IMAGE_NAME) as we use them in the tools/*.sh scripts:

export REGISTRY_NAME=<your-registry-name>
export IMAGE_NAME=<your-image-name>
docker build -t $REGISTRY_NAME/$IMAGE_NAME -f docker/Dockerfile .

Data Preparation

ICL and Replica

You can run the following script to download the ICL and Replica datasets:

bash tools/download.sh --data icl replica

This script will create a folder ./sample and download the datasets into the folder.

ScanNet

For ScanNet, please follow the instructions in ScanNet. Once you have the dataset downloaded, you can run the following script to prepare the data (example for scene scene0001_00):

python tools/prepare_scene.py --filename scene0001_00.sens --output_path sample/scannet/scene0001_00

Model Preparation

Please download the pretrained weight for SEEM from here and put it in as openfusion/zoo/xdecoder_seem/checkpoints/seem_focall_v1.pt.

Run OpenFusion

You can run OpenFusion using tools/run.sh as follows:

bash tools/run.sh --data $DATASET --scene $SCENE

Options:

--data: dataset to use (e.g., icl)
--scene: scene to use (e.g., kt0)
--frames: number of frames to use (default: -1)
--live: run with live monitor (default: False)
--stream: run with data stream from camera server (default: False)

If you want to run OpenFusion with camera stream, please run the following command first on the machine with Azure Kinect and Intel T265 connected:

python deploy/server.py

Please refer to this for more details.

Acknowledgement 🙇

SEEM: VLFM we used to extract region based features
Open3D: GPU accelerated 3D library for the base TSDF implementation

Citation 🙏

If you find this work helpful, please consider citing our work as:

@inproceedings{yamazaki2024open,
  title={Open-fusion: Real-time open-vocabulary 3d mapping and queryable scene representation},
  author={Yamazaki, Kashu and Hanyu, Taisei and Vo, Khoa and Pham, Thang and Tran, Minh and Doretto, Gianfranco and Nguyen, Anh and Le, Ngan},
  booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)},
  pages={9411--9417},
  year={2024},
  organization={IEEE}
}

Contact 📧

Please create an issue on this repository for questions, comments and reporting bugs. Send an email to Kashu Yamazaki for other inquiries.