Home

Awesome

<p align="center"> <img src="figures/logo.png" alt="logo" width = "600"> <br/> </p>

Open LLaMA Eyes to See the World

This project aims to optimize LLaMA model for visual information understanding like GPT-4 and further explore the potentional of large language model.

Generally, we use CLIP vision encoder to extract image features, then image features are projected with MLP-based or Transformer-based connection network into text embedding dimensionality. Then, visual representation (including additional special tokens [boi] and [eoi]) is concatenated with text representation to learn in a autoregressive manner. The framework is similar to kosmos-1 and PaLM-E.

Reference

[1] https://github.com/facebookresearch/llama

[2] https://github.com/tloen/alpaca-lora