Awesome

Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

Official implementation of 'Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding'.

[2023.5] We release ICCV2023 'ViewRefer3D', a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities with LLM.

[2023.9] We release AAAI2024 'Point-PEFT', adapting 3D pre-trained Models with 1% parameters to downstream tasks .

[2024.5] The results of Any2Point on ShapeNetPart will be released soon!

[2024.7] Any2Point has been accepted by ECCV 2024!

Introduction

Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method.

Main Results

We report the pre-training modality (Pre-train), the number of learnable parameters (#Param) on the "PB-T50-RS" split of ScanObjectNN (SCAN.) and ModelNet40 (MN.). * indicates utilizing the voting strategy.

Method	Pre-train	#Param(M)	SCAN.(%)	MN.(%)
PointNet	N/A	3.5	68.0	89.2
PointNet++	N/A	1.5	77.9	90.7
DGCNN	N/A	1.8	78.1	92.9
PointMLP	N/A	12.6	85.4	94.1
Point-PN	N/A	0.8	87.1	93.8
PointNeXt	N/A	1.4	87.7	94.0
Point-BERT	3D	22.1	83.1	92.7
Point-MAE	3D	22.1	85.2	93.2
Point-M2AE	3D	15.3	86.4	93.4
P2P-HorNet	2D	1.2	89.3	94.0*
ACT	3D+2D	22.1	88.2	93.7
I2P-MAE	3D+2D	12.9	90.1	93.7
ReCon	3D+2D+Language	43.6	90.6	94.1
Any2Point (Audio)	Audio	0.8	87.0	92.7
Any2Point (2D)	2D	0.8	87.7	93.2
Any2Point (Language)	Language	0.9	91.9	94.3

Ckpt Release

Real-world shape classification on the PB-T50-RS split of ScanObjectNN:

Method	Logs	Acc.	Ckpts
Any2Point-Lang-CLIP	Language_CLIP_Scan.log	91.9%	Language_CLIP_Scan.pth
Any2Point-Vision-DINOV2	Vision_DINOV2_Scan.log	87.7%	Vision_DINOV2_Scan.pth
Any2Point-Audio-ImageBind	Audio_imagebind_scan.log	87.0%	Audio_imagebind_scan.pth

Synthetic shape classification on the ModelNet40:

Method	Logs	Acc.	Ckpts
Any2Point-Lang-CLIP	Language_CLIP_ModelNet.log	94.3%	Language_CLIP_ModelNet.pth
Any2Point-Vision-DINOV2	Vision_DINOV2_ModelNet.log	93.2%	Vision_DINOV2_ModelNet.pth
Any2Point-Audio-ImageBind	Audio_imagebind_ModelNet.log	92.7%	Audio_imagebind_ModelNet.pth

Get Started

Installation

Create a conda environment and install basic dependencies:

git clone https://github.com/Ivan-Tang-3D/Any2Point.git
cd Any2Point

conda create -n Any2Point python=3.7
conda activate Any2Point

# Install the according versions of torch and torchvision
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch

conda install -c pyg pytorch-cluster pytorch-scatter pytorch-sparse -y
pip install torch-geometric==2.0

source install.sh

Dataset

For pre-training and fine-tuning, please follow DATASET.md to install ModelNet40, ScanObjectNN, and ShapeNetPart datasets, referring to Point-BERT. Specially Put the unzip folder under data/. The Language Part Training just occupies 26GB Memory.

The final directory structure should be:

│Any2Point/
├──Any2Point_CLIP_Lang/
├──ckpts/
├──data/
│   ├──ModelNet/
│   ├──ScanObjectNN/
├──...

Fine-tuning

Please download the CLIP_pre-train.pth, DINOV2_pre-train.pth and ImageBind_audio_pre-train.pth into the ckpts/ folder.

For the PB-T50-RS split of ScanObjectNN, run:

Any2Point_CLIP_Lang

cd Any2Point_CLIP_Lang
sh fine_tune.sh

Any2Point_DINOV2_Vision

cd Any2Point_DINOV2_Vision
sh fine_tune.sh

Any2Point_ImageBind_audio

cd Any2Point_ImageBind_audio
sh fine_tune.sh

For the ModelNet40, run:

Any2Point_CLIP_Lang

cd Any2Point_clip_lang_modelnet
sh fine_tune.sh

Any2Point_DINOV2

cd Any2Point_DINOV2_modelnet
sh fine_tune.sh

Any2Point_ImageBind

cd Any2Point_ImageBind_Modelnet
sh fine_tune.sh

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@article{tang2024any2point,
  title={Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding},
  author={Tang, Yiwen and Liu, Jiaming and Wang, Dong and Wang, Zhigang and Zhang, Shanghang and Zhao, Bin and Li, Xuelong},
  journal={arXiv preprint arXiv:2404.07989},
  year={2024}
}

Acknowledgement

This repo benefits from Pix4Point, Point-NN, PointTransformerV2, Openpoints. Thanks for their wonderful works.