Awesome
LLaRA: Large Language and Robotics Assistant
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [Arxiv]
Xiang Li<sup>1</sup>, Cristina Mata<sup>1</sup>, Jongwoo Park<sup>1</sup>, Kumara Kahatapitiya<sup>1</sup>, Yoo Sung Jang<sup>1</sup>, Jinghuan Shang<sup>1</sup>, Kanchana Ranasinghe<sup>1</sup>, Ryan Burgert<sup>1</sup>, Mu Cai<sup>2</sup>, Yong Jae Lee<sup>2</sup>, and Michael S. Ryoo<sup>1</sup>
<sup>1</sup>Stony Brook University <sup>2</sup>University of Wisconsin-Madison
<p float="left"> <img src="assets/llara-vid1.gif" width="49%" /> <img src="assets/llara-vid2.gif" width="49%" /> </p>Installation
-
Set Up Python Environment:
Follow the instructions to install the same Python environment as used by LLaVA.
conda create -n llara python=3.10 -y conda activate llara conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia conda install cuda=12.1 cuda-compiler=12.1 cuda-nvcc=12.1 cuda-version=12.1 -c nvidia
-
Install Revised LLaVA:
Navigate to
train-llava
in this repo and install the llava package there:cd train-llava && pip install -e ".[train]" pip install flash-attn --no-build-isolation
-
Install VIMABench:
Complete the setup for VIMABench.
git clone https://github.com/vimalabs/VimaBench && cd VimaBench pip install -e .
Demo
-
Download the Pretrained Model:
Download the following model to
./checkpoints/
- llava-1.5-7b-D-inBC + Aux(B) trained on VIMA-80k Hugging Face
More models are available at Model Zoo
-
Run the evaluation:
cd eval # evaluate the model with oracle object detector python3 eval-llara.py D-inBC-AuxB-VIMA-80k --model-path ../checkpoints/llava-1.5-7b-llara-D-inBC-Aux-B-VIMA-80k --prompt-mode hso # the results will be saved to ../results/[hso]D-inBC-AuxB-VIMA-80k.json
-
Check the results: Please refer to llara-result.ipynb
Quick Start Guide
- Minuiment Hardware Requirement:
- Inference: Requires at least one GPU with a minimum of 24GB RAM.
- Training: Requires a system with at least 300GB of system RAM and four Ampere (or newer) GPUs, each equipped with a minimum of 24GB of memory.
-
Prepare the Dataset:
Visit the datasets directory to prepare your dataset for training.
-
Finetune a LLaVA Model:
To start finetuning a LLaVA model, refer to the instructions in train-llava.
-
Evaluate the Trained Model:
Follow the steps in eval to assess the performance of your trained model.
-
Train a MaskRCNN for Object Detection:
If you want to train a MaskRCNN for object detection, check out train-maskrcnn for detailed steps.
Issues
If you encounter any issues or have questions about the project, please submit an issue on our GitHub issues page.
License
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.
Support us
If you find this work useful in your research, please consider giving it a star ⭐ and cite our work:
@article{li2024llara,
title={LLaRA: Supercharging Robot Learning Data for Vision-Language Policy},
author={Li, Xiang and Mata, Cristina and Park, Jongwoo and Kahatapitiya, Kumara and Jang, Yoo Sung and Shang, Jinghuan and Ranasinghe, Kanchana and Burgert, Ryan and Cai, Mu and Lee, Yong Jae and Ryoo, Michael S.},
journal={arXiv preprint arXiv:2406.20095},
year={2024}
}
Thanks!