Home

Awesome

CLIP-goes-3D

Official code for the paper "CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition"

arxiv / website

image

This repository includes the pre-trained models, evaluation and training codes for pre-training, zero-shot, and fine-tuning experiments of CG3D. It is built on the Point-BERT codebase. Please see the end of this document for a full list of code references.

To-Do:

Environment set-up

The known working environment configuration is

python 3.9
pytorch 1.12
CUDA 11.6
  1. Install the conda virtual environment using the provided .yml file.
    conda env create -f environment.yml 
    

(OR)

  1. Install dependencies manually.

    conda create -n cg3d
    conda activate cg3d
    
    pip install -r requirements.txt
    
    
    conda install -c anaconda scikit-image scikit-learn scipy
    
    pip install git+https://github.com/openai/CLIP.git
    
    pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl
    
    cd ./extensions/chamfer_dist
    python setup.py develop
    
  2. Build modified timm from scratch

    cd ./models/SLIP/pytorch-image-models
    pip install -e .
    
  3. Install PointNet ops

    cd third_party/Pointnet2_PyTorch
    pip install -e .
    pip install pointnet2_ops_lib/.
    
  4. Install PyGeM

cd third_party/PyGeM
python setup.py install

Dataset set-up

  1. Download point cloud datasets for pre-training and fine-tuning.
  1. Render views of textured CAD models of ShapeNet using this repository. We use a scale of 0.7 and 5 total views.

  2. The data should be organized as

├── data (this may be wherever you choose)
│   ├── modelnet40_normal_resampled
│   │   │── modelnet10/40_shape_names.txt
│   │   │── modelnet10/40_train/test.txt 
│   │   │── airplane
│   │   │── ...
│   │   │── laptop 
│   ├── ShapeNet55
│   │   │── train.txt
│   │   │── test.txt
│   │   │── shapenet_pc
│   │   │   |── 03211117-62ac1e4559205e24f9702e673573a443.npy
│   │   │   |── ...
│   ├── shapenet_render
│   │   │── train_img.txt
│   │   │── val_img.txt
│   │   │── shape_names.txt
│   │   │── taxonomy.json
│   │   │── camera
│   │   │── img
│   │   │   |── 02691156
│   │   │   |── ...
│   ├── ScanObjectNN
│   │   │── main_split
│   │   │── ...


1) Model weights

a) Pre-trained CG3D weights

Download SLIP model weights from here.

PointTransformer


No. of pointsModel fileTaskConfiguration file
1024downloadPre-traininglink
8192downloadPre-traininglink

PointMLP


No. of pointsModel fileTaskConfiguration file
1024downloadPre-traininglink
8192downloadPre-traininglink

Test Zero-Shot performance

  python eval.py --config cfgs/ShapeNet55_models/{CONFIG} --exp_name {NAME FOR EXPT}  --ckpts {CKPT PATH} --slip_model {PATH TO SLIP MODEL} --zshot --npoints {1024,8192}

b) Fine-tuning model weights

PointTransformer

DatasetModel WeightsTFBoard
ScanObjectNNdownloadlink
ModelNetdownloadlink

PointMLP

DatasetModel WeightsTFBoard
ScanObjectNNdownloadlink
ModelNetdownloadlink

2) Training CG3D

a) Pre-training

Zero-Shot Inference

python eval.py --config cfgs/ShapeNet55_models/{CONFIG} --exp_name {NAME FOR EXPT}  --ckpts {CKPT PATH} --slip_model {PATH TO SLIP MODEL} --zshot --npoints {1024,8192}

Fine-tuning Inference

python eval.py --config  cfgs/{ModelNet_models,ScanObjectNN_models}/{CONFIG} --exp_name {NAME FOR EXPT}  --ckpts {CKPT PATH} --npoints {1024,8192}

b) Fine-tuning

Finetuning PointTransformer:

CUBLAS_WORKSPACE_CONFIG=:4096:8 CUDA_VISIBLE_DEVICES=0 python finetune_cg3d.py --config cfgs/ModelNet_models/PointTransformer.yaml --exp_name {NAME OF EXPT} --finetune_model --ckpts {PATH OF PRETRAINED MODEL WEIGHTS} --dataset_root {PATH OF DATA STORAGE}

Finetuning PointMLP:

 CUBLAS_WORKSPACE_CONFIG=:4096:8 CUDA_VISIBLE_DEVICES=0 python finetune_cg3d.py --config cfgs/ModelNet_models/PointMLP.yaml --exp_name {NAME OF EXPT} --finetune_model --ckpts {PATH OF PRETRAINED MODEL WEIGHTS} --dataset_root {PATH OF DATA STORAGE}

Please change the .yml files to change the finetuning dataset from ModelNet to ScanObjectNN, etc.

References

Citation

@article{hegde2023clip,
 title={CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition},
 author={Hegde, Deepti and Valanarasu, Jeya Maria Jose and Patel, Vishal M},
 journal={arXiv preprint arXiv:2303.11313},
 year={2023}
}