Home

Awesome

<h2 align="center"> <span><img src="assets/logo025.png" width="4%" style="transform: translate(0,9px)"></span> <b>SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding</b> </h2> <div align="center" margin-bottom="6em"> <a target="_blank" href="https://buzz-beater.github.io/">Baoxiong Jia<sup>✶</sup></a>, <a target="_blank" href="https://yixchen.github.io/">Yixin Chen<sup>✶</sup></a>, <a target="_blank" href="https://scholar.google.com/citations?user=fKRgnIMAAAAJ/">Huangyue Yu</a>, <a target="_blank" href="https://github.com/jetpackfirstme">Yan Wang</a>, <a target="_blank" href="https://nxsedson.github.io/">Xuesong Niu</a>, <a target="_blank" href="https://tengyu.ai/">Tengyu Liu</a>, <a target="_blank" href="https://liqing-ustc.github.io/">Qing Li</a>, <a target="_blank" href="https://siyuanhuang.com/">Siyuan Huang</a> </div> &nbsp; <div align="center"> <a href="https://arxiv.org/abs/2401.09340" target="_blank"> <img src="https://img.shields.io/badge/Paper-arXiv-deepgreen" alt="Paper arXiv"></a> <a href="https://scene-verse.github.io" target="_blank"> <img src="https://img.shields.io/badge/Project-Page-9cf" alt="Project Page"></a> <a href="https://youtu.be/UnujS0EVxKU" target="_blank"> <img src="https://img.shields.io/badge/Video-YouTube-9966ff" alt="Video"></a> <a href="https://scene-verse.github.io" target="_blank"> <img src="https://img.shields.io/badge/Data-SceneVerse-blue" alt="Data"></a> <a href="https://scene-verse.github.io" target="_blank"> <img src="https://img.shields.io/badge/Model-GPS-darkorange" alt="Model"></a> </div> &nbsp; <div align="left"> <img src="assets/overview.png" width="99%" alt="SceneVerse Teaser"> </div>

We propose SceneVerse, the first million-scale 3D vision-language dataset with 68K 3D indoor scenes and 2.5M vision-language pairs. We demonstrate the scaling effect by (i) achieving state-of-the-art on all existing 3D visual grounding benchmarks and (ii) showcasing zero-shot transfer capabilities with our GPS (Grounded Pre-training for Scenes) model.

News

Data

See DATA.md for detailed instructions on data download, processing, visualization. The data inventory is listed below:

DatasetObject CaptionScene CaptionRef-AnnotationRef-Pairwise<br>rel2Ref-MultiObject<br>relmRef-Star<br>starRef-Chain (Optional)<br>chain
ScanNetScanRefer<br>Nr3D
MultiScan
ARKitScenes
HM3Dtemplate
3RScan
Structured3Dtemplate
ProcTHORtemplatetemplatetemplatetemplate

Training and Inference

See TRAIN.md for the inventory of available checkpoints and detailed instructions on training and testing with pre-trained checkpoints. The checkpoint inventory is listed below:

SettingDescriptionCorresponding ExperimentCheckpoint based on experiment setting
pre-trainedGPS model pre-trained on SceneVerse3D-VL grounding (Tab.2)Model
scratchGPS model trained on datasets from scratch3D-VL grounding (Tab.2)<br/>SceneVerse-val (Tab. 3)ScanRefer, Sr3D, Nr3D, SceneVerse-val
fine-tunedGPS model fine-tuned on datasets with grounding heads3D-VL grounding (Tab.2)ScanRefer, Sr3D, Nr3D
zero-shotGPS model trained on SceneVerse without data from ScanNet and MultiScanZero-shot Transfer (Tab.3)Model
zero-shot textGPSZero-shot Transfer (Tab.3)ScanNet, SceneVerse-val
text-ablationAblations on the type of language used during pre-trainingAblation on Text (Tab.7)Template only, Template+LLM
scene-ablationAblations on the use of synthetic scenes during pre-trainingAblation on Scene (Tab.8)Real only, S3D only, ProcTHOR only
model-ablationAblations on the use of losses during pre-trainingAblation on Model Design (Tab.9)Refer only, Refer+Obj-lvl, w/o Scene-lvl

BibTex

@inproceedings{jia2024sceneverse,
  title={Sceneverse: Scaling 3d vision-language learning for grounded scene understanding},
  author={Jia, Baoxiong and Chen, Yixin and Yu, Huangyue and Wang, Yan and Niu, Xuesong and Liu, Tengyu and Li, Qing and Huang, Siyuan},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2024}
}

Acknowledgements

We thank the authors from ScanRefer, ScanNet, 3RScan, ReferIt3D, Structured3D, HM3D, ProcTHOR, ARKitScenes, MultiScan for open-sourcing their awesome datasets. We also heavily adapted codes from ScanQA, SQA3D, and 3D-VisTA for training and inference.