Awesome
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding
Official implementation of 'Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding'.
[2023.5] We release ICCV2023 'ViewRefer3D', a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities with LLM.
[2023.9] We release AAAI2024 'Point-PEFT', adapting 3D pre-trained Models with 1% parameters to downstream tasks .
[2024.5] The results of Any2Point on ShapeNetPart will be released soon!
[2024.7] Any2Point has been accepted by ECCV 2024!
<p align="center"> <img src="Teaser_any.png"/ width="70%"> <br> </p>Introduction
Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method.
<div align="center"> <img src="Intro_any.png"/> </div>Main Results
We report the pre-training modality (Pre-train), the number of learnable parameters (#Param) on the "PB-T50-RS" split of ScanObjectNN (SCAN.) and ModelNet40 (MN.). * indicates utilizing the voting strategy.
Method | Pre-train | #Param(M) | SCAN.(%) | MN.(%) |
---|---|---|---|---|
PointNet | N/A | 3.5 | 68.0 | 89.2 |
PointNet++ | N/A | 1.5 | 77.9 | 90.7 |
DGCNN | N/A | 1.8 | 78.1 | 92.9 |
PointMLP | N/A | 12.6 | 85.4 | 94.1 |
Point-PN | N/A | 0.8 | 87.1 | 93.8 |
PointNeXt | N/A | 1.4 | 87.7 | 94.0 |
Point-BERT | 3D | 22.1 | 83.1 | 92.7 |
Point-MAE | 3D | 22.1 | 85.2 | 93.2 |
Point-M2AE | 3D | 15.3 | 86.4 | 93.4 |
P2P-HorNet | 2D | 1.2 | 89.3 | 94.0* |
ACT | 3D+2D | 22.1 | 88.2 | 93.7 |
I2P-MAE | 3D+2D | 12.9 | 90.1 | 93.7 |
ReCon | 3D+2D+Language | 43.6 | 90.6 | 94.1 |
Any2Point (Audio) | Audio | 0.8 | 87.0 | 92.7 |
Any2Point (2D) | 2D | 0.8 | 87.7 | 93.2 |
Any2Point (Language) | Language | 0.9 | 91.9 | 94.3 |
Ckpt Release
Real-world shape classification on the PB-T50-RS split of ScanObjectNN:
Method | Logs | Acc. | Ckpts |
---|---|---|---|
Any2Point-Lang-CLIP | Language_CLIP_Scan.log | 91.9% | Language_CLIP_Scan.pth |
Any2Point-Vision-DINOV2 | Vision_DINOV2_Scan.log | 87.7% | Vision_DINOV2_Scan.pth |
Any2Point-Audio-ImageBind | Audio_imagebind_scan.log | 87.0% | Audio_imagebind_scan.pth |
Synthetic shape classification on the ModelNet40:
Method | Logs | Acc. | Ckpts |
---|---|---|---|
Any2Point-Lang-CLIP | Language_CLIP_ModelNet.log | 94.3% | Language_CLIP_ModelNet.pth |
Any2Point-Vision-DINOV2 | Vision_DINOV2_ModelNet.log | 93.2% | Vision_DINOV2_ModelNet.pth |
Any2Point-Audio-ImageBind | Audio_imagebind_ModelNet.log | 92.7% | Audio_imagebind_ModelNet.pth |
Get Started
Installation
Create a conda environment and install basic dependencies:
git clone https://github.com/Ivan-Tang-3D/Any2Point.git
cd Any2Point
conda create -n Any2Point python=3.7
conda activate Any2Point
# Install the according versions of torch and torchvision
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch
conda install -c pyg pytorch-cluster pytorch-scatter pytorch-sparse -y
pip install torch-geometric==2.0
source install.sh
Dataset
For pre-training and fine-tuning, please follow DATASET.md to install ModelNet40, ScanObjectNN, and ShapeNetPart datasets, referring to Point-BERT. Specially Put the unzip folder under data/
.
The Language Part Training just occupies 26GB Memory.
The final directory structure should be:
│Any2Point/
├──Any2Point_CLIP_Lang/
├──ckpts/
├──data/
│ ├──ModelNet/
│ ├──ScanObjectNN/
├──...
Fine-tuning
Please download the CLIP_pre-train.pth, DINOV2_pre-train.pth and ImageBind_audio_pre-train.pth into the ckpts/
folder.
For the PB-T50-RS split of ScanObjectNN, run:
Any2Point_CLIP_Lang
cd Any2Point_CLIP_Lang
sh fine_tune.sh
Any2Point_DINOV2_Vision
cd Any2Point_DINOV2_Vision
sh fine_tune.sh
Any2Point_ImageBind_audio
cd Any2Point_ImageBind_audio
sh fine_tune.sh
For the ModelNet40, run:
Any2Point_CLIP_Lang
cd Any2Point_clip_lang_modelnet
sh fine_tune.sh
Any2Point_DINOV2
cd Any2Point_DINOV2_modelnet
sh fine_tune.sh
Any2Point_ImageBind
cd Any2Point_ImageBind_Modelnet
sh fine_tune.sh
Citation
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.
@article{tang2024any2point,
title={Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding},
author={Tang, Yiwen and Liu, Jiaming and Wang, Dong and Wang, Zhigang and Zhang, Shanghang and Zhao, Bin and Li, Xuelong},
journal={arXiv preprint arXiv:2404.07989},
year={2024}
}
Acknowledgement
This repo benefits from Pix4Point, Point-NN, PointTransformerV2, Openpoints. Thanks for their wonderful works.