Home

Awesome

CodeBotler Overview

Build Status

CodeBotler Web Interface

CodeBotler is a system that converts natural language task descriptions into robot-agnostic programs that can be executed by general-purpose service mobile robots. It includes a benchmark (RoboEval) designed for evaluating Large Language Models (LLMs) in the context of code generation for mobile robot service tasks.

This project consists of two key components:

Project website: https://amrl.cs.utexas.edu/codebotler

Requirements

We provide a conda environment to run our code. To create and activate the environment:

conda create -n codebotler python=3.10
conda activate codebotler
pip install -r requirements.txt

After installing the conda environment, please go to pytorch's official website to install the pytorch corresponding to your cuda version (Note: do not install the cpu version).

Language Model Options

CodeBotler Deployment Quick-Start Guide

To run the web interface for CodeBotler-Deploy using the default options (using OpenAI's gpt-4 model), run:

python3 codebotler.py

This will start the server on localhost:8080. You can then open the interface by navigating to http://localhost:8080/ in your browser.

List of arguments:

Instructions for deploying on real robots are included in robot_interface/README.md.

RoboEval Benchmark Quick-Start Guide

The instructions below demonstrate how to run the benchmark using the open-source StarCoder model.

  1. Run code generation for the benchmark tasks using the following command:

    python3 roboeval.py --generate --generate-output completions/starcoder \
        --model-type automodel --model-name "bigcode/starcoder" 
    

    This will generate the programs for the benchmark tasks and save them as a Python file in an output directory completions/starcoder. It assumes default values for temperature (0.2), top-p (0.9), and num-completions (20), to generate 20 programs for each task --- this will suffice for pass@1 evaluation.

    If you would rather not re-run inference, we have included saved output from every model in the completions/ directory as a zip file. You can simply run.

    cd completions
    unzip -d <MODEL_NAME> <MODEL_NAME>.zip
    

    For example, you can run:

    cd completions
    unzip -d gpt4 gpt4.zip
    
  2. Evaluate the generated programs using the following command:

    python3 roboeval.py --evaluate --generate-output <Path-To-Program-Completion-Directory> --evaluate-output <Path-To-Evaluation-Result-File-Name>
    

    For example:

    python3 roboeval.py --evaluate --generate-output completions/gpt4/ --evaluate-output benchmark/evaluations/gpt4
    

    This will evaluate the generated programs from the previous step, and save all the evaluation results in an python file.

    If you would rather not re-run evaluation, we have included saved evaluation output from every model in the benchmark/evaluations directory.

  3. Finally, you can compute pass@1 score for every task:

    python3 evaluate_pass1.py --llm codellama --tasks all
    

    or

    python3 evaluate_pass1.py --llm codellama --tasks CountSavory WeatherPoll