Home

Awesome

<!-- PROJECT LOGO --> <p align="center"> <h1 align="center"><img src="https://pengsongyou.github.io/media/openscene/logo.png" width="40">OpenScene: 3D Scene Understanding with Open Vocabularies</h1> <p align="center"> <a href="https://pengsongyou.github.io"><strong>Songyou Peng</strong></a> · <a href="https://www.kylegenova.com/"><strong>Kyle Genova</strong></a> · <a href="https://www.maxjiang.ml/"><strong>Chiyu "Max" Jiang</strong></a> · <a href="https://taiya.github.io/"><strong>Andrea Tagliasacchi</strong></a> <br> <a href="https://people.inf.ethz.ch/pomarc/"><strong>Marc Pollefeys</strong></a> · <a href="https://www.cs.princeton.edu/~funk/"><strong>Thomas Funkhouser</strong></a> </p> <h2 align="center">CVPR 2023</h2> <h3 align="center"><a href="https://arxiv.org/abs/2211.15654">Paper</a> | <a href="https://youtu.be/jZxCLHyDJf8">Video</a> | <a href="https://pengsongyou.github.io/openscene">Project Page</a></h3> <div align="center"></div> </p> <p align="center"> <a href=""> <img src="https://pengsongyou.github.io/media/openscene/teaser.jpg" alt="Logo" width="100%"> </a> </p> <p align="center"> <strong>OpenScene</strong> is a zero-shot approach to perform a series of novel 3D scene understanding tasks using open-vocabulary queries. </p> <br> <!-- TABLE OF CONTENTS --> <details open="open" style='padding: 10px; border-radius:5px 30px 30px 5px; border-style: solid; border-width: 1px;'> <summary>Table of Contents</summary> <ol> <li> <a href="#interactive-demo">Interactive Demo</a> </li> <li> <a href="#installation">Installation</a> </li> <li> <a href="#data-preparation">Data Preparation</a> </li> <li> <a href="#run">Run</a> </li> <li> <a href="#applications">Applications</a> </li> <li> <a href="#todo">TODO</a> </li> <li> <a href="#acknowledgement">Acknowledgement</a> </li> <li> <a href="#citation">Citation</a> </li> </ol> </details>

News :triangular_flag_on_post:

Interactive Demo

No GPU is needed! Follow this instruction to set up and play with the real-time demo yourself.

<p align="center"> <img src="./media/demo.gif" width="75%" /> </p>

Here we present a real-time, interactive, open-vocabulary scene understanding tool. A user can type in an arbitrary query phrase like snoopy (rare object), somewhere soft (property), made of metal (material), where can I cook? (activity), festive (abstract concept) etc, and the correponding regions are highlighted.

Installation

Follow the installation.md to install all required packages so you can do the evaluation & distillation afterwards.

Data Preparation

We provide the pre-processed 3D&2D data and multi-view fused features for the following datasets:

Pre-processed 3D&2D Data

You can preprocess the dataset yourself, see the data pre-processing instruction.

Alternatively, we have provided the preprocessed datasets. One can download the pre-processed datasets by running the script below, and following the command line instruction to download the corresponding datasets:

bash scripts/download_dataset.sh

The script will download and unpack data into the folder data/. One can also download the dataset somewhere else, but link to the corresponding folder with the symbolic link:

ln -s /PATH/TO/DOWNLOADED/FOLDER data
<details> <summary><strong>List of provided processed data</strong> (click to expand):</summary> </details>

Note: 2D processed datasets (e.g. scannet_2d) are only needed if you want to do multi-view feature fusion on your own. If so, please follow the instruction for multi-view fusion.

Multi-view Fused Features

To evaluate our OpenScene model or distill a 3D model, one needs to have the multi-view fused image feature for each 3D point (see method in Sec. 3.1 in the paper).

You can run the following to directly download provided fused features:

bash scripts/download_fused_features.sh
<details> <summary><strong>List of provided fused features</strong> (click to expand):</summary> </details>

Alternatively, you can also generate multi-view features yourself following the instruction.

Run

When you have installed the environment and obtained the processed 3D data and multi-view fused features, you are ready to run our OpenScene disilled/ensemble model for 3D semantic segmentation, or distill your own model from scratch.

Evaluation for 3D Semantic Segmentation with a Pre-defined Labelsets

<p align="center"> <img src="./media/benchmark_screenshot.jpg" width="80%" /> </p>

Here you can evaluate OpenScene features on different dataset (ScanNet/Matterport3D/nuScenes/Replica) that have pre-defined labelsets. We already include the following labelsets in label_constants.py:

The general command to run evaluation:

sh run/eval.sh EXP_DIR CONFIG.yaml feature_type

where you specify your experiment directory EXP_DIR, and replace CONFIG.yaml with the correct config file under config/. feature_type corresponds to per-point OpenScene features:

To evaluate with distill and ensemble, the easiest way is to use a pre-trained 3D distilled model. You can do this by using one of the config files with postfix _pretrained.

For example, to evaluate the semantic segmentation on Replica, you can simply run:

# 2D-3D ensemble
sh run/eval.sh out/replica_openseg config/replica/ours_openseg_pretrained.yaml ensemble

# Run 3D distilled model
sh run/eval.sh out/replica_openseg config/replica/ours_openseg_pretrained.yaml distill

# Evaluate with 2D fused features
sh run/eval.sh out/replica_openseg config/replica/ours_openseg_pretrained.yaml fusion

The script will automatically download the pretrained 3D model and run the evaluation for Matterport 21 classes. You can find all outputs in the out/replica_openseg.

For evaluation options, see under TEST inside config/replica/ours_openseg_pretrained.yaml. Below are important evaluation options that you might want to modify:

If you want to use a 3D model distilled from scratch, specify the model_path to the correponding checkpoints EXP/model/model_best.pth.tar.

Distillation

Finally, if you want to distill a new 3D model from scratch, run:

For available distillation options, please take a look at DISTILL inside config/matterport/ours_openseg.yaml

Using Your Own Datasets

  1. Follow the data preprocessing instruction, modify codes accordingly to obtain the processed 2D&3D data
  2. Follow the feature fusion instruction, modify codes to obtain multi-view fused features.
  3. You can distill a model on your own, or take our provided 3D distilled model weights (e.g. our 3D model for ScanNet or Matterport3D), and modify the model_path accordingly.
  4. If you want to evaluate on a specific labelset, change the labelset in config.

Applications

Besides the zero-shot 3D semantic segmentation, we can perform also the following tasks:

Acknowledgement

We sincerely thank Golnaz Ghiasi for providing guidance on using OpenSeg model. Our appreciation extends to Huizhong Chen, Yin Cui, Tom Duerig, Dan Gnanapragasam, Xiuye Gu, Leonidas Guibas, Nilesh Kulkarni, Abhijit Kundu, Hao-Ning Wu, Louis Yang, Guandao Yang, Xiaoshuai Zhang, Howard Zhou, and Zihan Zhu for helpful discussion. We are also grateful to Charles R. Qi and Paul-Edouard Sarlin for their proofreading.

We build some parts of our code on top of the BPNet repository.

TODO

We are very much welcome all kinds of contributions to the project.

Citation

If you find our code or paper useful, please cite

@inproceedings{Peng2023OpenScene,
  title     = {OpenScene: 3D Scene Understanding with Open Vocabularies},
  author    = {Peng, Songyou and Genova, Kyle and Jiang, Chiyu "Max" and Tagliasacchi, Andrea and Pollefeys, Marc and Funkhouser, Thomas},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2023}