Home

Awesome

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

[Project Page] [Paper] [Video]

Wenlong Huang<sup>1</sup>, Chen Wang<sup>1</sup>, Ruohan Zhang<sup>1</sup>, Yunzhu Li<sup>1,2</sup>, Jiajun Wu<sup>1</sup>, Li Fei-Fei<sup>1</sup>

<sup>1</sup>Stanford University, <sup>2</sup>University of Illinois Urbana-Champaign

<img src="media/teaser.gif" width="550">

This is the official demo code for VoxPoser, a method that uses large language models and vision-language models to zero-shot synthesize trajectories for manipulation tasks.

In this repo, we provide the implementation of VoxPoser in RLBench as its task diversity best resembles our real-world setup. Note that VoxPoser is a zero-shot method that does not require any training data. Therefore, the main purpose of this repo is to provide a demo implementation rather than an evaluation benchmark.

If you find this work useful in your research, please cite using the following BibTeX:

@article{huang2023voxposer,
      title={VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models},
      author={Huang, Wenlong and Wang, Chen and Zhang, Ruohan and Li, Yunzhu and Wu, Jiajun and Fei-Fei, Li},
      journal={arXiv preprint arXiv:2307.05973},
      year={2023}
    }

Setup Instructions

Note that this codebase is best run with a display. For running in headless mode, refer to the instructions in RLBench.

conda create -n voxposer-env python=3.9
conda activate voxposer-env
pip install -r requirements.txt

Running Demo

Demo code is at src/playground.ipynb. Instructions can be found in the notebook.

Code Structure

Core to VoxPoser:

Environment and utilities:

Acknowledgments