Home

Awesome

Steve-Eye: Equiping LLM-based Embodied Agents with Visual Perception in Open Worlds

<div align="center">

[Website] [Arxiv Paper]

PyPI - Python Version <img src="https://img.shields.io/badge/Framework-PyTorch-red.svg"/> GitHub license

<div align="left">

Overview

Steve-Eye is an end-to-end trained large multimodal model to address this limitation, which integrates the LLM with a visual encoder to process visual-text inputs and generate multimodal feedback. We adopt a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, enabling our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning. Our contribution can be summarized as:

Model

To be released soon

Dataset

To be released soon

<div align="center"> <img src="figs/steve-eye.png" /> </div>

Environmental Visual Captioning (ENV-VC) Results

ModelVisual EncoderInventory <img src="figs/icons/inventory.png" height="15pt">Equip <img src="figs/icons/iron-axe.png" height="15pt">Object in Sight <img src="figs/icons/cow.png" height="15pt">Life <img src="figs/icons/heart.jpg" height="15pt">Food <img src="figs/icons/hunger.png" height="15pt">Sky <img src="figs/icons/sky.png" height="15pt">
BLIP-2CLIP41.658.564.788.587.957.6
Llama-2-7b-------
Steve-Eye-7bVQ-GAN89.978.387.492.190.268.5
Steve-Eye-13bMineCLIP44.561.872.289.288.668.2
Steve-Eye-13bVQ-GAN91.179.689.892.790.872.7
Steve-Eye-13bCLIP92.582.892.193.191.573.8

Foundational Knowledge Question Answering (FK-QA) Results

ScoringAccuracy
Wiki PageWiki TableRecipeTEXT AllTEXTIMG
-----------------------------------------------------------------------
Llama-2-7b6.906.217.106.6237.01%-
Llama-2-13b6.31 (-0.59)6.16 (-0.05)6.31 (-0.79)6.24 (-0.38)37.96%-
Llama-2-70b6.91 (+0.01)6.97 (+0.76)7.23 (+0.13)7.04 (+0.42)38.27%-
gpt-turbo-3.57.26 (+0.36)7.15 (+0.94)7.97 (+0.87)7.42 (+0.80)41.78%-
Steve-Eye-7b7.21 (+0.31)7.28 (+1.07)7.82 (+0.72)7.54 (+0.92)43.25%62.83%
Steve-Eye-13b7.38 (+0.48)7.44 (+1.23)7.93 (+0.83)7.68 (+1.06)44.36%65.13%

Skill Planning Results

Model<img src="figs/mc/stick.png" height="15pt"><img src="figs/mc/crafting_table.png" height="15pt"><img src="figs/mc/bowl.png" height="15pt"><img src="figs/mc/chest.png" height="15pt"><img src="figs/mc/trapdoor.png" height="15pt"><img src="figs/mc/sign.png" height="15pt"><img src="figs/mc/wooden_pickaxe.png" height="15pt"><img src="figs/mc/furnace.png" height="15pt"><img src="figs/mc/stone_stairs.png" height="15pt"><img src="figs/mc/stone_slab.png" height="15pt"><img src="figs/mc/cobblestone_wall.png" height="15pt"><img src="figs/mc/lever.png" height="15pt"><img src="figs/mc/torch.png" height="15pt"><img src="figs/mc/stone_pickaxe.png" height="15pt">
MineAgent0.000.030.000.000.000.000.000.000.000.000.210.00.050.0
gpt assistant0.300.170.070.000.030.000.200.000.200.030.130.000.100.00
Steve-Eye-auto0.300.270.370.230.200.170.260.070.130.170.200.330.000.13
Steve-Eye0.400.300.430.530.330.370.430.300.430.470.470.400.130.23
Model<img src="figs/mc/milk_bucket.png" height="15pt"><img src="figs/mc/wool.png" height="15pt"><img src="figs/mc/beef.png" height="15pt"><img src="figs/mc/mutton.png" height="15pt"><img src="figs/mc/bed.png" height="15pt"><img src="figs/mc/painting.png" height="15pt"><img src="figs/mc/carpet.png" height="15pt"><img src="figs/mc/item_frame.png" height="15pt"><img src="figs/mc/cooked_beef.png" height="15pt"><img src="figs/mc/cooked_mutton.png" height="15pt">
MineAgent0.460.500.330.350.00.00.060.00.00.0
gpt assistant0.570.760.430.300.000.000.370.000.030.00
Steve-Eye-auto0.700.630.400.300.1700.370.030.070.00
Steve-Eye0.730.670.470.330.230.070.430.100.170.07

Citation

@article{zheng2023steve,
 	 title={Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds},
 	 author={Zheng, Sipeng and Liu, Jiazheng and Feng, Yicheng and Lu, Zongqing},
  	journal={arXiv preprint arXiv:2310.13255},
  	year={2023}
}