Awesome

M3DRef-CLIP

This is the official implementation for Multi3DRefer: Grounding Text Description to Multiple 3D Objects.

Model Architecture

Requirement

This repo contains CUDA implementation, please make sure your GPU compute capability is at least 3.0 or above.

We report the max computing resources usage with batch size 4:

	Training	Inference
GPU mem usage	15.2 GB	11.3 GB

Setup

Conda (recommended)

We recommend the use of miniconda to manage system dependencies.

# create and activate the conda environment
conda create -n m3drefclip python=3.10
conda activate m3drefclip

# install PyTorch 2.0.1
conda install pytorch torchvision pytorch-cuda=11.7 -c pytorch -c nvidia

# install PyTorch3D with dependencies
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
conda install pytorch3d -c pytorch3d

# install MinkowskiEngine with dependencies
conda install -c anaconda openblas
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \
--install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas"

# install Python libraries
pip install .

# install CUDA extensions
cd m3drefclip/common_ops
pip install .

Pip

Note: Setting up with pip (no conda) requires OpenBLAS to be pre-installed in your system.

# create and activate the virtual environment
virtualenv env
source env/bin/activate

# install PyTorch 2.0.1
pip install torch torchvision

# install PyTorch3D
pip install pytorch3d

# install MinkowskiEngine
pip install MinkowskiEngine

# install Python libraries
pip install .

# install CUDA extensions
cd m3drefclip/common_ops
pip install .

Data Preparation

Note: Both ScanRefer and Nr3D datasets requires the ScanNet v2 dataset. Please preprocess it first.

ScanNet v2 dataset

Download the ScanNet v2 dataset (train/val/test), (refer to ScanNet's instruction for more details). The raw dataset files should be organized as follows:

m3drefclip # project root
├── dataset
│   ├── scannetv2
│   │   ├── scans
│   │   │   ├── [scene_id]
│   │   │   │   ├── [scene_id]_vh_clean_2.ply
│   │   │   │   ├── [scene_id]_vh_clean_2.0.010000.segs.json
│   │   │   │   ├── [scene_id].aggregation.json
│   │   │   │   ├── [scene_id].txt

Pre-process the data, it converts original meshes and annotations to .pth data:

python dataset/scannetv2/preprocess_all_data.py data=scannetv2 +workers={cpu_count}

Pre-process the multiview features from ENet: Please refer to the instructions in ScanRefer's repo with one modification:
- comment out lines 51 to 56 in batch_load_scannet_data.py since we follow D3Net's setting that doesn't do point downsampling here.
Then put the generated enet_feats_maxpool.hdf5 (116GB) under m3drefclip/dataset/scannetv2

ScanRefer dataset

Download the ScanRefer dataset (train/val). Also, download the test set. The raw dataset files should be organized as follows:

m3drefclip # project root
├── dataset
│   ├── scanrefer
│   │   ├── metadata
│   │   │   ├── ScanRefer_filtered_train.json
│   │   │   ├── ScanRefer_filtered_val.json
│   │   │   ├── ScanRefer_filtered_test.json

Pre-process the data, "unique/multiple" labels will be added to raw .json files for evaluation purpose:
```
python dataset/scanrefer/add_evaluation_labels.py data=scanrefer
```

Nr3D dataset

Download the Nr3D dataset (train/test). The raw dataset files should be organized as follows:

m3drefclip # project root
├── dataset
│   ├── nr3d
│   │   ├── metadata
│   │   │   ├── nr3d_train.csv
│   │   │   ├── nr3d_test.csv

Pre-process the data, "easy/hard/view-dep/view-indep" labels will be added to raw .csv files for evaluation purpose:
```
python dataset/nr3d/add_evaluation_labels.py data=nr3d
```

Multi3DRefer dataset

Downloading the Multi3DRefer dataset (train/val). The raw dataset files should be organized as follows:

m3drefclip # project root
├── dataset
│   ├── multi3drefer
│   │   ├── metadata
│   │   │   ├── multi3drefer_train.json
│   │   │   ├── multi3drefer_val.json

Pre-trained detector

We pre-trained PointGroup implemented in MINSU3D on ScanNet v2 and use it as the detector. We use coordinates + colors + multi-view features as inputs.

Download the pre-trained detector. The detector checkpoint file should be organized as follows:
```
m3drefclip # project root
├── checkpoints
│   ├── PointGroup_ScanNet.ckpt
```

Training, Inference and Evaluation

Note: Configuration files are managed by Hydra, you can easily add or override any configuration attributes by passing them as arguments.

# log in to WandB
wandb login

# train a model with the pre-trained detector, using predicted object proposals
python train.py data={scanrefer/nr3d/multi3drefer} experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt

# train a model with the pretrained detector, using GT object proposals
python train.py data={scanrefer/nr3d/multi3drefer} experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt model.network.detector.use_gt_proposal=True

# train a model from a checkpoint, it restores all hyperparameters in the .ckpt file
python train.py data={scanrefer/nr3d/multi3drefer} experiment_name={checkpoint_experiment_name} ckpt_path={ckpt_file_path}

# test a model from a checkpoint and save its predictions
python test.py data={scanrefer/nr3d/multi3drefer} data.inference.split={train/val/test} ckpt_path={ckpt_file_path} pred_path={predictions_path}

# evaluate predictions
python evaluate.py data={scanrefer/nr3d/multi3drefer} pred_path={predictions_path} data.evaluation.split={train/val/test}

Checkpoints

ScanRefer dataset

M3DRef-CLIP_ScanRefer.ckpt

Performance:

Split	IoU	Unique	Multiple	Overall
Val	0.25	85.3	43.8	51.9
Val	0.5	77.2	36.8	44.7
Test	0.25	79.8	46.9	54.3
Test	0.5	70.9	38.1	45.5

Nr3D dataset

M3DRef-CLIP_Nr3d.ckpt

Performance:

Split	Easy	Hard	View-dep	View-indep	Overall
Test	55.6	43.4	42.3	52.9	49.4

Multi3DRefer dataset

M3DRef-CLIP_Multi3DRefer.ckpt

Performance:

Split	IoU	ZT w/ D	ZT w/o D	ST w/ D	ST w/o D	MT	Overall
Val	0.25	39.4	81.8	34.6	53.5	43.6	42.8
Val	0.5	39.4	81.8	30.6	47.8	37.9	38.4

Benchmark

ScanRefer

Convert M3DRef-CLIP predictions to ScanRefer benchmark format:

python dataset/scanrefer/convert_output_to_benchmark_format.py data=scanrefer pred_path={predictions_path} +output_path={output_file_path}

Nr3D

Please refer to ReferIt3D benchmark to report results.