Home

Awesome

Point-Bind & Point-LLM: Aligning 3D with Multi-modality

Official implementation of 'Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following'.

News

Point-Bind

With a joint embedding space of 3D and multi-modality, our Point-Bind empowers four promising applications:

<p align="center"> <img src="Applications.png"/ width="90%"> <br> </p>

Point-LLM

Using Point-Bind, we introduce Point-LLM, the first 3D LLM that responds to instructions with 3D point cloud conditions, supporting both English and Chinese. Our Point-LLM exhibits two main characters:

<p align="center"> <img src="3D Q&A.png"/ width="100%"> <br> </p>

The overall pipeline of Point-LLM is as follows. We efficiently fine-tune LLaMA 7B for 3D instruction-following capacity referring to LLaMA-Adapter and ImageBind-LLM:

<p align="center"> <img src="Pipeline.png"/ width="100%"> <br> </p>

Getting Started

Please refer to Install.md for preparing environments and pre-trained checkpoints.

3D with Multi-modalities

We provide simple inference scripts to verify the embedding alignment for 3D and other modalities in Point-Bind.

Compare 3D with Text

Run python demo_text_3d.py with input:

text_list = ['An airplane', 'A car', 'A toilet']
point_paths = ["examples/airplane.pt", "examples/car.pt", "examples/toilet.pt"]

Output the similarity matrix:

Text x Point Cloud
tensor([[1.0000e+00, 6.5731e-09, 6.5958e-10],
        [1.7373e-06, 9.9998e-01, 1.7816e-05],
        [2.1133e-10, 3.4070e-08, 1.0000e+00]])

Compare 3D with Audio

Run python demo_audio_3d.py with input: Input

audio_paths = ["examples/airplane_audio.wav", "examples/car_audio.wav", "examples/toilet_audio.wav"]
point_paths = ["examples/airplane.pt", "examples/car.pt", "examples/toilet.pt"]

Output the similarity matrix:

Audio x Point Cloud: 
tensor([[0.9907, 0.0041, 0.0051],
        [0.0269, 0.9477, 0.0254],
        [0.0057, 0.0170, 0.9773]])

3D Zero-shot Tasks

For 3D zero-shot classification, please follow DATASET.md to download ModelNet40, and put it under data/modelnet40_normal_resampled/. Then run bash scripts/pointbind_i2pmae.sh or bash scripts/pointbind_pointbert.sh for Point-Bind with I2P-MAE or Point-BERT encoder.

Zero-shot classification accuracy comparison:

ModelEncoderModeNet40 (%)
PointCLIP2D CLIP20.2
ULIPPoint-BERT60.4
PointCLIP V22D CLIP64.2
ULIP 2Point-BERT66.4
Point-BindPoint-BERT76.3
Point-BindI2P-MAE78.0

Inference & Demo for Point-LLM

Setup

Inference

Demo

Try out our web demo, which incorporates multi-modality including 3D point cloud supported by ImageBind-LLM

Contributors

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Peng Gao

Related Work

Other excellent works for incorporating 3D point clouds and LLMs:

Contact

If you have any questions about this project, please feel free to contact zhangrenrui@pjlab.org.cn and zyguo@cse.cuhk.edu.hk.