

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

<p align="left"> <a href='https://arxiv.org/pdf/2308.04352.pdf'> <img src='https://img.shields.io/badge/Paper-PDF-red?style=plastic&logo=adobeacrobatreader&logoColor=red' alt='Paper PDF'> </a> <a href='https://arxiv.org/abs/2308.04352'> <img src='https://img.shields.io/badge/Paper-arXiv-green?style=plastic&logo=arXiv&logoColor=green' alt='Paper arXiv'> </a> <a href='https://3d-vista.github.io/'> <img src='https://img.shields.io/badge/Project-Page-blue?style=plastic&logo=Google%20chrome&logoColor=blue' alt='Project Page'> </a> <a href='https://huggingface.co/spaces/SceneDiffuser/SceneDiffuserDemo'> <img src='https://img.shields.io/badge/Demo-HuggingFace-yellow?style=plastic&logo=AirPlay%20Video&logoColor=yellow' alt='HuggingFace'> </a> <a href='https://drive.google.com/drive/folders/1UZ5V9VbPCU-ikiyj6NI4LyMssblwr1LC?usp=share_link'> <img src='https://img.shields.io/badge/Model-Checkpoints-orange?style=plastic&logo=Google%20Drive&logoColor=orange' alt='Checkpoints'> </a> </p>

Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong DengšŸ“§, Siyuan HuangšŸ“§, Qing LišŸ“§

This repository is the official implementation of the ICCV 2023 paper "3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment".

Paper | arXiv | Project | HuggingFace Demo | Checkpoints

<div align=center> <img src='https://3d-vista.github.io/file/overall.png' width=60%> </div>


3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and dense captioning to question answering and situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning.


  1. Install conda package
conda env create --name 3dvista --file=environments.yml
  1. install pointnet2
cd vision/pointnet2
python3 setup.py install

Prepare dataset

  1. Follow Vil3dref and download scannet data under data/scanfamily/scan_data, this folder should look like
ā”œā”€ā”€ instance_id_to_gmm_color
ā”œā”€ā”€ instance_id_to_loc
ā”œā”€ā”€ instance_id_to_name
ā””ā”€ā”€ pcd_with_global_alignment
  1. Download scanrefer+referit3d, scanqa, and sqa3d, and put them under /data/scanfamily/annotations
ā”œā”€ā”€ meta_data
ā”‚   ā”œā”€ā”€ cat2glove42b.json
ā”‚   ā”œā”€ā”€ scannetv2-labels.combined.tsv
ā”‚   ā”œā”€ā”€ scannetv2_raw_categories.json
ā”‚   ā”œā”€ā”€ scanrefer_corpus.pth
ā”‚   ā””ā”€ā”€ scanrefer_vocab.pth
ā”œā”€ā”€ qa
ā”‚   ā”œā”€ā”€ ScanQA_v1.0_test_w_obj.json
ā”‚   ā”œā”€ā”€ ScanQA_v1.0_test_wo_obj.json
ā”‚   ā”œā”€ā”€ ScanQA_v1.0_train.json
ā”‚   ā””ā”€ā”€ ScanQA_v1.0_val.json
ā”œā”€ā”€ refer
ā”‚   ā”œā”€ā”€ nr3d.jsonl
ā”‚   ā”œā”€ā”€ scanrefer.jsonl
ā”‚   ā”œā”€ā”€ sr3d+.jsonl
ā”‚   ā””ā”€ā”€ sr3d.jsonl
ā”œā”€ā”€ splits
ā”‚   ā”œā”€ā”€ scannetv2_test.txt
ā”‚   ā”œā”€ā”€ scannetv2_train.txt
ā”‚   ā””ā”€ā”€ scannetv2_val.txt
ā””ā”€ā”€ sqa_task
    ā”œā”€ā”€ answer_dict.json
    ā””ā”€ā”€ balanced
        ā”œā”€ā”€ v1_balanced_questions_test_scannetv2.json
        ā”œā”€ā”€ v1_balanced_questions_train_scannetv2.json
        ā”œā”€ā”€ v1_balanced_questions_val_scannetv2.json
        ā”œā”€ā”€ v1_balanced_sqa_annotations_test_scannetv2.json
        ā”œā”€ā”€ v1_balanced_sqa_annotations_train_scannetv2.json
        ā””ā”€ā”€ v1_balanced_sqa_annotations_val_scannetv2.json
  1. Download all checkpoints and put them under project/pretrain_weights
Pre-trainedlink3D-VisTA Pre-trained checkpoint.
ScanReferlinkFine-tuned ScanRefer from pre-trained checkpoint.
ScanQAlinkFine-tined ScanQA from pre-trained checkpoint.
Sr3DlinkFine-tuned Sr3D from pre-trained checkpoint.
Nr3DlinkFine-tuned Nr3D from pre-trained checkpoint.
SQAlinkFine-tuned SQA from pre-trained checkpoint.
Scan2CaplinkFine-tuned Scan2Cap from pre-trained checkpoint.

Run 3D-VisTA

To run 3D-VisTA, use the following command, task includes scanrefer, scanqa, sr3d, nr3d, sqa, and scan2cap.

python3 run.py --config project/vista/{task}_config.yml


We would like to thank the authors of Vil3dref and for their open-source release.



  title={3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment},
  author={Zhu, Ziyu and Ma, Xiaojian and Chen, Yixin and Deng, Zhidong and Huang, Siyuan and Li, Qing},