Home

Awesome

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

<a href="https://yunzeman.github.io/" style="color:blue;">Yunze Man</a> · <a href="https://zsh2000.github.io/" style="color:blue;">Shuhong Zheng</a> · <a href="https://zpbao.github.io/" style="color:blue;">Zhipeng Bao</a> · <a href="http://www.cs.cmu.edu/~hebert" style="color:blue;">Martial Hebert</a> · <a href="https://cs.illinois.edu/about/people/department-faculty/lgui" style="color:blue;">Liang-Yan Gui</a> · <a href="https://yxw.web.illinois.edu/" style="color:blue;">Yu-Xiong Wang</a>

[NeurIPS 2024] [Project Page] [arXiv] [pdf] [BibTeX]

Framework: PyTorch arXiv Project GitHub License

This repository contains the official PyTorch implementation of the paper "Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding". The paper is available on arXiv. The project page is online at here. This work is accepted by NeurIPS 2024.

About

<img src="assets/visualization.png" width="100%"/> <img src="assets/pipeline.png" width="100%"/> We design a unified framework, as shown in the Figure above, to extract features from different foundation models, construct a 3D feature embedding as scene embeddings, and evaluate them on multiple downstream tasks. For a complex indoor scene, existing work usually represents it with a combination of 2D and 3D modalities. Given a complex scene represented in posed images, videos, and 3D point clouds, we extract their feature embeddings with a collection of vision foundation models. For image- and video-based models, we project their features into 3D space for the subsequent 3D scene evaluation tasks with a multi-view 3D projection module. <br><br>

We also visualize the scene features extracted by the vision foundation models.

BibTeX

If you use our work in your research, please cite our publication:

@inproceedings{man2024lexicon3d,
      title={Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding},
      author={Man, Yunze and Zheng, Shuhong and Bao, Zhipeng and Hebert, Martial and Gui, Liang-Yan and Wang, Yu-Xiong},
      booktitle={Advances in Neural Information Processing Systems},
      year={2024} 
      }

News

Environment Setup

Please install the required packages and dependencies according to the requirements.txt file.

In addition,

Dataset Preparation. Download the ScanNet dataset from the official repository and follow the instructions here to preprocess the ScanNet dataset and get RGB video frames and point clouds for each scannet scene.

Feature Extraction

To extract features from the foundation models, please run the corresponding scripts in the lexicon3d folder. For example, to extract features from the LSeg model, please run the following command:

python fusion_scannet_clip.py  --data_dir dataset/ScanNet/openscene/  --output_dir  dataset/lexicon3d/clip/ --split train --prefix clip

This script will extract features from the LSeg model for the ScanNet dataset. The extracted features will be saved in the output_dir folder, containing the feature embeddings, points, and voxel grids.

Evaluation on Downstream Tasks

For evaluation, we provide the scripts to evaluate the extracted features on the downstream tasks. Detailed instructions can be found in the evals folder. For example, to evaluate the extracted features on the 3D Question Answering task, please cd to the evals/3D-LLM/3DLLM_BLIP2-base folder and run the following command:

python -m torch.distributed.run --nproc_per_node=4 train.py --cfg-path lavis/projects/blip2/train/finetune_sqa.yaml

Refer to the evals folder for more details on the evaluation scripts.

Acknowledgements

This repo is built based on the fantastic work OpenScene. We also thank the authors of P3DA and the authors of all relevant visual foundation models for their great work and open-sourcing their codebase.