Awesome
Chat-3D v2
This is an official repo for paper "Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers". [paper]
News
[2024.04] 🔥 A refined implementation of Chat-3D v2 is released. The old version v2.0 has been archived in branch v2.0. This main branch is now for the new version (v2.1).
[2024.01] Update training guide for grounding on ScanRefer.
[2023.12] Code release. The main training architecture is based on our former work Chat-3D.
🔥 v2.1 vs v2.0
-
Performance comparison
ScanRefer ScanQA Scan2Cap Multi3dRefer SQA3D Acc@0.25 Acc@0.5 CIDEr B-4 CIDEr@0.5 B-4@0.5 F1@0.25 F1@0.5 EM v2.0 35.9 30.4 77.1 7.3 28.1 15.5 - - - v2.1 42.5 38.4 87.6 14.0 63.9 31.8 45.1 41.6 54.7 All results of v2.1 are evaluated on the same model without finetuning on specific tasks.
-
Main changes
-
LLM backbone: Vicuna v0 -> Vicuna v1.5 + LoRA finetuning
-
Training scheme: three-stage training -> one-stage joint training
-
Segmentor: PointGroup -> Mask3D
-
Code Optimization:
- batch size: 1 -> 32
- Simpler training and evaluating process
-
🔨 Preparation
-
Prepare the environment:
(Different from v2.0)
conda create -n chat-3d-v2 python=3.9.17 conda activate chat-3d-v2 conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia pip install -r requirements.txt
-
Download LLM backbone:
-
We use Vicuna-7B v1.5 in our experiments, which can be downloaded from Hugging Face.
-
Change the
llama_model_path
in config.py to the location ofvicuna-7b-v1.5
.
-
-
Annotations and extracted features:
Please follow the instructions in preprocess.
🤖 Training and Inference
-
Training
-
Modify run.sh:
<details> <summary> Explanation of "train_tag" and "val_tag" </summary>train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align" val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref" evaluate=False
-
Use
#
to seperate different datasets -
Datasets:
-
You can try different combination of training datasets or add costumized datasets.
-
-
Run:
bash scripts/run.sh
-
Brief training info:
Batch Size GPU VRAM Usage per GPU Training Time ckpt 32 4 * A100 ~ 70 GB ~ 8 hours Google Drive 1 1 * A100 ~ 28 GB ~ 3 days -
-
-
Inference
-
Modify run.sh: (We provide the pretrained checkpoint in Google Drive)
val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref" evaluate=True pretrained_path="/path/to/pretrained_model.pth"
-
Run:
bash scripts/run.sh
-
📄 Citation
If you find this project useful in your research, please consider cite:
@article{huang2023chat,
title={Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers},
author={Huang, Haifeng and Wang, Zehan and Huang, Rongjie and Liu, Luping and Cheng, Xize and Zhao, Yang and Jin, Tao and Zhao, Zhou},
journal={arXiv preprint arXiv:2312.08168},
year={2023}
}
@article{wang2023chat,
title={Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes},
author={Wang, Zehan and Huang, Haifeng and Zhao, Yang and Zhang, Ziang and Zhao, Zhou},
journal={arXiv preprint arXiv:2308.08769},
year={2023}
}
Stay tuned for our project. 🔥
If you have any questions or suggestions, feel free to drop us an email (huanghaifeng@zju.edu.cn
, wangzehan01@zju.edu.cn
) or open an issue.
😊 Acknowledgement
Thanks to the open source of the following projects:
3D Datasets: ScanNet, ScanRefer, ReferIt3D, Scan2Cap, ScanQA, SQA3D, Multi3dRefer
3D Segmentors: PointGroup, Mask3D
Multi-modal LLMs: VideoChat, LEO
3D Expert Models: vil3dref