Home

Awesome

<br/> <p align="center"> <h1 align="center"><a style="color:#61a5c2;">3D</a>-<a style="color:#94D2BD;">V</a><a style="color:#EE9B00;">L</a><a style="color:#CA6502;">A</a>: A 3D Vision-Language-Action Generative World Model</h1> <p align="center"> ICML 2024 </p> <p align="center"> <a href="https://haoyuzhen.com">Haoyu Zhen</a>, <a href="">Xiaowen Qiu</a>, <a href="https://peihaochen.github.io">Peihao Chen</a>, <a href="https://github.com/Yang-Chincheng">Jincheng Yang</a>, <a href="https://cakeyan.github.io">Xin Yan</a>, <a href="https://yilundu.github.io">Yilun Du</a>, <a href="https://evelinehong.github.io">Yining Hong</a>, <a href="https://people.csail.mit.edu/ganchuang">Chuang Gan</a> </p> <p align="center"> <a href="https://arxiv.org/abs/2403.09631"> <img src='https://img.shields.io/badge/Paper-PDF-red?style=flat&logo=arXiv&logoColor=red' alt='Paper PDF'> </a> <a href='https://vis-www.cs.umass.edu/3dvla' style='padding-left: 0.5rem;'> <img src='https://img.shields.io/badge/Project-Page-blue?style=flat&logo=Google%20chrome&logoColor=blue' alt='Project Page'> </a> </p> </p> <!-- TABLE OF CONTENTS --> <details open="open" style='padding: 10px; border-radius:5px 30px 30px 5px; border-style: solid; border-width: 1px;'> <summary>Tabel of Contents</summary> <ol> <li> <a href="#method">Method</a> </li> <li> <a href="#installation">Installation</a> </li> <li> <a href="#embodied-diffusion-models">Embodied Diffusion Models</a> <ul> <li><a href="#goal-image-generation">Goal Image Generation</a></li> </ul> <ul> <li><a href="#goal-point-cloud-generation">Goal Point Cloud Generation</a></li> </ul> </li> <li> <a href="#multimodal-large-language-model">Multimodal Large Language Model</a> <ul> <li><a href="#pretrain-3d-vla">Pretrain 3D-VLA</a></li> </ul> </li> <li> <a href="#citation">Citation</a> </li> <li> <a href="#acknowledgement">Acknowledgement</a> </li> </ol> </details>

News 📢

Method

3D-VLA is a framework that connects vision-language-action (VLA) models to the 3D physical world. Unlike traditional 2D models, 3D-VLA integrates 3D perception, reasoning, and action through a generative world model, similar to human cognitive processes. It is built on the 3D-LLM and uses interaction tokens to engage with the environment. Embodied diffusion models are trained and aligned with the LLM to predict goal images and point clouds.

<p align="center"> <img src="docs/method.png" alt="Logo" width="80%"> </p>

Installation

conda create -n 3dvla python=3.9
conda activate 3dvla
pip install -r requirements.txt

We will update the file structure and the installation process in the future.

We provide a model card for the 3D-VLA model. The model card includes the task description, model description, and training datasets.

Embodied Diffusion Models

Goal Image Generation

Train the goal image latent diffusion model with the following command. If you want to include depth information, you could add --include_depth to the command in the train_ldm.sh file.

bash launcher/train_ldm.sh [NUM_GPUS] [NUM_NODES]

Then you could generate the goal images. The results will be saved in the lavis/output/LDM/pix2pix/results folder.

python inference_ldm_goal_image.py \
    --ckpt_folder lavis/output/LDM/pix2pix/runs (--include_depth)

We have released our model on Hugging Face: goal-image and goal-depth. A simple demo can be run using the following command:

python inference_ldm_goal_image.py \
    --ckpt_folder anyezhy/3dvla-diffusion \
    --image docs/cans.png --text "knock pepsi can over" \
    --save_path result.png

python inference_ldm_goal_image.py \
    --ckpt_folder anyezhy/3dvla-diffusion-depth --include_depth \
    --image docs/bottle.png --text "move water bottle near sponge" \
    --save_path result.png

Goal Point Cloud Generation

We have implemented xFormers for the goal point cloud diffusion model. You could install it and accelerate the training and inference process.

Train the goal point cloud diffusion model (finetuning the pretrained Point-E model).

bash launcher/train_pe.sh [NUM_GPUS] [NUM_NODES]

We have released our model on Hugging Face: goal-point-cloud. Inferece the goal point cloud with the following command. If you want to use multiple GPUs, use torchrun --nproc_per_node=[NUM_GPUS] --master_port=[PORT] inference_pe_goal_pcd.py instead.

python inference_pe_goal_pcd.py \
  --input_npy docs/point_cloud.npy --text "close bottom drawer" \
  --output_dir SAVE_PATH

python inference_pe_goal_pcd.py \
  --input_npy docs/money.npy \
  --text "put the money away in the safe on the bottom shelf"

Multimodal Large Language Model

Pretrain 3D-VLA

Train our 3D-VLA model:

bash launcher/train_llm.sh [NUM_GPUS] [NUM_NODES]

Citation

@article{zhen20243dvla,
  author = {Zhen, Haoyu and Qiu, Xiaowen and Chen, Peihao and Yang, Jincheng and Yan, Xin and Du, Yilun and Hong, Yining and Gan, Chuang},
  title = {3D-VLA: 3D Vision-Language-Action Generative World Model},
  journal = {arXiv preprint arXiv:2403.09631},
  year = {2024},
}

Acknowledgement

Here we would like to thank the following resources for their great work: