Home

Awesome

ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts

Bilel Benjdira, Anis Koubaa and Anas M. Ali arXiv YouTube

Robotics and Internet of Things Lab (RIOTU Lab), Prince Sultan University, Saudi Arabia

Inspired by ROSGPT. Both projects aim to bridge the gap between robotics, natural language understanding, and image analysis.

Collaborators who want to participate in this project, are very welcome.


Video Demo

An illustrative video demonstration of ROSGPT_Vision is provided: ROSGPT Video Demonstration

Table of Contents

Overview

ROSGPT_Vision offers a unified platform that allows robots to perceive, interpret, and interact with visual data through natural language. The framework leverages state-of-the-art language models, including LLAVA, MiniGPT-4, and Caption-Anything, to facilitate advanced reasoning about image data. LangChain is used for easy customization of the prompts. The provided implementation includes the CarMate application, a driver monitoring and assistance system designed to ensure safe and efficient driving experiences.

ROSGPT_Vision diagram

<img src="https://github.com/bilel-bj/ROSGPT_Vision/blob/main/ROSGPT_Vision.png" width="900" height="600"/>

Prompting Robotic Modalities (PRM) Design Pattern

** for more information go to arXiv

<img src="https://github.com/bilel-bj/ROSGPT_Vision/blob/main/IRM_Diagram%20(1).png" width="800" height="500"/>

CarMate Application

CarMate is a complete application for monitoring driver behavior which was developed just by setting two prompts in the YAML file. It automatically analyses the input video using the Visual prompt, analyses what should be done using the LLM prompt, and gives an instant alert to the driver when needed.

These are the prompts used to develop the application, without needing extra code:

The Visual prompt:

Visual prompt: "Describe the driver’s current level of focus 
on driving based on the visual cues, Answer with one short sentence."

The LLM prompt:

LLM prompt:"Consider the following ontology: You must write your Reply 
with one short sentence. Behave as a carmate that surveys the driver 
and gives him advice and instruction to drive safely. You will be given 
human language prompts describing an image. Your task is to provide 
	appropriate instructions to the driver based on the description."

We can see three examples of scenarios, got during the driving:

Scenario 1: The driver is using phone

We can see in the top box the description generated by the image semantics module for the input image using the Visual prompt. Meanwhile, the second box generates the alert that should be given to the driver using the LLM prompt.

<img src="https://github.com/bilel-bj/ROSGPT_Vision/blob/main/demo-distraction-phone.png" width="900" height="600"/>

Scenario 2: The driver is taking pictures

<img src="https://github.com/bilel-bj/ROSGPT_Vision/blob/main/demo-distraction-taking-pictures.png" width="900" height="600"/>

Scenario 3: The driver is drinking

<img src="https://github.com/bilel-bj/ROSGPT_Vision/blob/main/demo-distraction-drinking.png" width="900" height="600"/>

Installation

To use ROSGPT_Vision, follow these steps:

1. Prepare the code and the environment

Git clone our repository, creating a python environment and ativate it via the following command

  git clone https://github.com/bilel-bj/ROSGPT_Vision.git
  cd ROSGPT_Vision
  git clone https://github.com/Vision-CAIR/MiniGPT-4.git
  git clone https://github.com/haotian-liu/LLaVA.git
  conda env create -f environment.yml
  conda activate ROSGPT_Vision

2. Install the required dependencies

Usage

  1. To regulate all parameters associated with ROSGPT_Vision, modifications can be made within the corresponding .yaml file.

The YAML contains 6 main sections of configurations parameters:

  1. Run in Terminal local machine
        colcon build --packages-select rosgpt_vision
		    source install/setup.bash
		    python3 src/rosgpt_vision/rosgpt_vision/rosgpt_vision_node_web_cam.py
		    python3 src/rosgpt_vision/rosgpt_vision/ROSGPT_Vision_Camera_Node.py /home/anas/ros2_ws/src/rosgpt_vision/rosgpt_vision/cfg/driver_phone_usage.yaml
        colcon build --packages-select rosgpt_vision 
		    source install/setup.bash
		    python3 src/rosgpt_vision/rosgpt_vision/ROSGPT_Vision_GPT_Consultation_Node.py /home/anas/ros2_ws/src/rosgpt_vision/rosgpt_vision/cfg/driver_phone_usage.yaml

bash ros2 topic echo /Image_Description

bash ros2 topic echo /GPT_Consultation

Citation

arXiv

@misc{benjdira2023rosgptvision,
  title={ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts}, 
  author={Bilel Benjdira and Anis Koubaa and Anas M. Ali},
  year={2023},
  eprint={2308.11236},
  archivePrefix={arXiv},
  primaryClass={cs.RO}
  }

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. You are free to use, share, and adapt this material for non-commercial purposes, as long as you provide attribution to the original author(s) and the source.

Acknowledgement

The codes are based on ROSGPT, LLAVA, MiniGPT-4, Caption-Anything and SAM. Please also follow their licenses. Thanks for their awesome works.

Contribute

As this project is still under progress, contributions are welcome! To contribute, please follow these steps:

  1. Fork the repository on GitHub.
  2. Create a new branch for your feature or bugfix.
  3. Commit your changes and push them to your fork.
  4. Create a pull request to the main repository.

Before submitting your pull request, please ensure that your changes do not break the build and adhere to the project's coding style.

For any questions or suggestions, please open an issue on the GitHub issue tracker.