Home

Awesome

vimGPT

Giving multimodal models an interface to play with.

https://github.com/ishan0102/vimGPT/assets/47067154/467be2ac-7e8d-47de-af89-5bb6f51c1c31

Overview

LLMs as a way to browse the web is being explored by numerous startups and open-source projects. With this project, I was interested in seeing if we could only use GPT-4V's vision capabilities for web browsing.

The issue with this is it's hard to determine what the model wants to click on without giving it the browser DOM as text. Vimium is a Chrome extension that lets you navigate the web with only your keyboard. I thought it would be interesting to see if we could use Vimium to give the model a way to interact with the web.

Usage

Install Python requirements:

pip install -r requirements.txt

Download Vimium locally (have to load the extension manually when running Playwright):

./setup.sh

Run the script:

python main.py

Voice Mode

Voice Mode: Engage with the browser using voice commands. Simply say your objective, and watch vimGPT perform actions in real-time.

python main.py --voice

Ideas

Feel free to collaborate with me on this, I have a number of ideas:

Shoutouts

References