Awesome
Context-aware Alignment and Mutual Masking for 3D-Language Pre-training
This repository is for the paper "Context-aware Alignment and Mutual Masking for 3D-Language Pre-training" (CVPR 2023)
<p align='center'> <img src='fig/overview.png' width="1000px"> </p>Abstract
3D visual language reasoning plays an important role in effective human-computer interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks. Despite the encouraging progress in vision-language pre-training for image-text data, 3D-language pre-training is still an open issue due to limited 3D-language paired data, highly sparse and irregular structure of point clouds and ambiguities in spatial relations of 3D objects with viewpoint changes. In this paper, we present a generic 3D-language pre-training approach, that tackles multiple facets of 3D-language reasoning by learning universal representations. Our learning objective constitutes two main parts. 1) Context aware spatial-semantic alignment to establish fine-grained correspondence between point clouds and texts. It reduces relational ambiguities by aligning 3D spatial relationships with textual semantic context. 2) Mutual 3D-Language Masked modeling to enable cross-modality information exchange. Instead of reconstructing sparse 3D points for which language can hardly provide cues, we propose masked proposal reasoning to learn semantic class and mask-invariant representations. Our proposed 3D-language pre-training method achieves promising results once adapted to various downstream tasks, including 3D visual grounding, 3D dense captioning and 3D question answering.
Dataset & Setup
Data preparation
Our codes are built based on ScanRefer, 3DJCG and ScanQA codebase. Please refer to them for more detailed data preprocessing instructions.
- Download the ScanRefer dataset and unzip it under
data/
. - Download the ScanQA dataset under
data/qa/
. - Download the preprocessed GLoVE embeddings (~990MB) and put them under
data/
. - Download the ScanNetV2 dataset and put (or link)
scans/
under (or to)data/scannet/scans/
(Please follow the ScanNet Instructions for downloading the ScanNet dataset).
After this step, there should be folders containing the ScanNet scene data under the
data/scannet/scans/
with names likescene0000_00
- Pre-process ScanNet data. A folder named
scannet_data/
will be generated underdata/scannet/
after running the following command. Roughly 3.8GB free space is needed for this step:
cd data/scannet/
python batch_load_scannet_data.py
After this step, you can check if the processed scene data is valid by running:
python visualize.py --scene_id scene0000_00
- (Optional) Pre-process the multiview features from ENet.
-
Download: Download the ENet multiview features (~36GB, hdf5 database) and put it under
data/scannet/scannet_data/
-
Projection:
a. Download the ENet pretrained weights (1.4MB) and put it under
data/
b. Download and decompress the extracted ScanNet frames (~13GB). c. Change the data paths inlib/config.py
marked with TODO accordingly. d. Project ENet features from ScanNet frames to point clouds (~36GB, hdf5 database).
python script/multiview_compute/compute_multiview_features.py python script/multiview_compute/project_multiview_features.py --maxpool --gpu 1
Setup
The codes are tested on Ubuntu 20.04.1 LTS with PyTorch 1.8.0 and CUDA 11.1 installed.
Create and activate a conda environment, for example:
conda create -n 3D-VLP python=3.6
conda activate 3D-VLP
Install pytorch:
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
Install the required packages listed in requirements.txt
:
pip install -r requirements.txt
Run the following commands to compile the CUDA modules for the PointNet++ backbone:
cd lib/pointnet2
python setup.py install
Usage
Pre-training
To pre-train the model, run the following command:
sh scripts/pretrain.sh
The pre-trained models will be saved under outputs/exp_pretrain/
.
Fine-tuning
Fine-tune the model on ScanRefer dataset for 3D visual grounding and dense captioning:
sh scripts/finetune_scanrefer.sh
Fine-tune the model on ScanQA for 3D question answering:
sh scripts/finetune_scanqa.sh
Evaluate
Before evaluation, please specify the <folder_name> (outputs/ with the timestamp + <tag_name>) of the fine-tuned model and then run the following commands. For 3D visual grounding:
sh scripts/eval_ground.sh
For 3D dense captioning:
sh scripts/eval_cap.sh
For 3D question answering:
sh scripts/eval_qa.sh
Results
<p align='center'> <img src='fig/results.png' width="1000px"> </p>3D visual grounding
<p align='center'> <img src='fig/grounding_vis.png' width="1000px"> </p>3D dense captioning
<p align='center'> <img src='fig/captioning_vis.png' width="1000px"> </p>3D question answering
<p align='center'> <img src='fig/qa_vis.png' width="1000px"> </p>The visualization results of point clouds are obtained through MeshLab.
Citation
@inproceedings{jin2023context,
title={Context-aware Alignment and Mutual Masking for 3D-Language Pre-training},
author={Jin, Zhao and Hayat, Munawar and Yang, Yuwei and Guo, Yulan and Lei, Yinjie},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10984--10994},
year={2023}
}
Acknowledgement
We would like to thank facebookresearch/votenet for the 3D object detection and daveredrum/ScanRefer for the 3D localization codebase.
License
This repository is released under MIT License.