Awesome

[NeurIPS 2023] Weakly Supervised 3D Open-vocabulary Segmentation

This repository contains a pytorch implementation for the paper: Weakly Supervised 3D Open-vocabulary Segmentation. Our method can segment 3D scenes using open-vocabulary texts without requiring any segmentation annotations.

Installation

Tested on Ubuntu 20.04 + Pytorch 1.12.1

Install environment:

conda create -n 3dovs python=3.9
conda activate 3dovs
pip install torch torchvision
pip install ftfy regex tqdm scikit-image opencv-python configargparse lpips imageio-ffmpeg kornia tensorboard
pip install git+https://github.com/openai/CLIP.git

Datasets

Please download the datasets from this link and put the datasets in ./data. You can put the datasets elsewhere if you modify the corresponding paths in the configs. The datasets are organized as

/data
|  /scene0
|  |--/images
|  |  |--00.png
|  |  |--01.png
|  |  ...
|  |--/segmentations
|  |  |--classes.txt
|  |  |--/test_view0
|  |  |  |--class0.png
|  |  |  ...
|  |  |--/test_view1
|  |  |  |--class0.png
|  |  |  ...
|  |  ...
|  |--poses_bounds.npy
|  /scene1
|  ...

where images contains the RGB images, segmentations contains the segmentation annotations for the test views, segmentations/classes.txt stores the classes' text descriptions, and poses_bounds.npy contains the camera poses generated by Colmap.

Quick Start

We provide the checkpoints for the scenes in this link. You can then test the segmentation by:

bash scripts/test_segmentation.sh [CKPT_PATH] [CONFIG_FILE] [GPU_ID]

The config files are stored in configs, each file is named after configs/$scene_name.txt. The results will be saved in the checkpoint's path. More details can be found in scripts/test_segmentation.sh.

Data Preparation

We need to extract a hierarchy of CLIP features from image patches for training. You can extract the CLIP features by: (Please modify $scene_name to the scene name you want to extract features for)

bash scripts/extract_clip_features.sh data/$scene_name/images clip_features/$scene_name [GPU_ID]

The extracted features will be saved in clip_features/$scene_name.

Training

1. Train original TensoRF

This step is for reconstructing the TensoRF for the scenes. Please modify the datadir and expname in configs/resonstruction.txt to specify the dataset path and the experiment name. By default we set datadir to data/$scene_name and expname as $scene_name. You can then train the original TensoRF by:

bash script/reconstruction.sh [GPU_ID]

The reconstructed TensoRF will be saved in log/$scene_name.

2. Train segmentation

We provide the training script for our datasets under configs as $scene_name.txt. You can train the segmentation by:

bash scripts/segmentation.sh [CONFIG_FILE] [GPU_ID]

The trained model will be saved in log_seg/$scene_name. The training takes about 1h30min and consumes about 14GB GPU memory.

Trouble Shooting

1. Loading CLIP features is very slow

That is because the CLIP features are very large (has 512 channels) and consume lots of memory. You can load fewer views' CLIP features by setting clip_input to 0.5 or smaller values in the config file. Normally 0.5 is enough for good performance.

2. Prompt engineering

To test if your prompts are good, you can set test_prompt to a view number in the config file. You will then see the relevancy maps in this view for each class in clip_features/clip_relevancy_maps. Each relevancy map is named as scale_class.png. You can then check if the relevancy maps are good for each class. If not, you can modify the prompts in segmentations/classes.txt and test again. In our experiments, we find that specific descriptions of objects that include the object's texture and color work better.

3. Custom data

For custom scenes, you can generate the camera poses using Colmap following the recover camera poses section from this link. If your custom data does not have annotated segmentation maps, you can set has_segmentation_maps to 0 in the config file.

4. Bad segmentation results

The bad segmentation results may be due to poor geometry reconstruction, erroneous camera poses, or inaccurate text prompts. If none of the above are the main reasons, you can try adjusting the dino_neg_weight in the config file. Usually, if the segmentation results do not align well with the object boundaries, you can set dino_neg_weight to a value larger than 0.2, such as 0.22. If the segmentation is making mistakes, you can set dino_neg_weight to a value smaller than 0.2, such as 0.18. Since dino_neg_weight encourages the model to assign different labels when the DINO features are distant, the higher it is, the more unstable the model becomes, but it also encourages sharper boundaries.

TODO

Currently we only support faceforwarding scenes, it can be extended to support unbounded 360 scenes using some coordinate transformation.

Acknowledgments

This repo is heavily based on the TensoRF. Thank them for sharing their amazing work!

Citation

@article{liu2023weakly,
  title={Weakly Supervised 3D Open-vocabulary Segmentation},
  author={Liu, Kunhao and Zhan, Fangneng and Zhang, Jiahui and Xu, Muyu and Yu, Yingchen and Saddik, Abdulmotaleb El and Theobalt, Christian and Xing, Eric and Lu, Shijian},
  journal={arXiv preprint arXiv:2305.14093},
  year={2023}
}