Awesome
VL-GPT
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
Project Termination
We regret to inform that the project (VL-GPT) has been terminated. Unfortunately, the authors Jinguo and Xiaohan left the company and did not manage to refactor the codebase before their checkout. As a result, the source code and weights for this work cannot be released.
However, the main contribution from this work, an image tokenizer with continuous embedding and applying it in Large Multimodal Model, has also been adopted in another project within our team called SEED-X, which has been made open source already. We recommend to refer to the SEED-X project for insights and implementation details.
We sincerely apologize for not being able to release this work as an open-source project. Thank you for your understanding.
Introduction
<div align="center"> <span class="author-block"> <a href="https://scholar.google.com/citations?user=YfHg5lQAAAAJ&hl=en" target="_blank">Jinguo Zhu</a><sup>1*</sup>, </span> <span class="author-block"> <a href="https://dingxiaohan.xyz/" target="_blank">Xiaohan Ding</a><sup>2*</sup>, </span> <span class="author-block"> </span> <a href="https://geyixiao.com/" target="_blank">Yixiao Ge</a><sup>2</sup>, </span> <span class="author-block"> </span> <a href="https://geyuying.github.io/" target="_blank">Yuying Ge</a><sup>2</sup>, </span> </br> <span class="author-block"> <a target="_blank">Sijie Zhao</a><sup>2</sup>, </span> <span class="author-block"> <a href="https://hszhao.github.io/" target="_blank">Hengshuang Zhao</a><sup>3</sup>, </span> <span class="author-block"> <a href="https://gr.xjtu.edu.cn/web/xhw" target="_blank">Xiaohua Wang</a><sup>1</sup>, </span> <span class="author-block"> <a href="https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en&oi=ao" target="_blank">Ying Shan</a><sup>2</sup> </span> </div> <div align="center"> <sup>1</sup> <a target='_blank'>Xi'an Jiaotong University</a> <sup>2</sup> <a href='https://ai.tencent.com/' target='_blank'>Tencent AI Lab</a> <sup>3</sup> <a target='_blank'>The University of Hong Kong</a>  </br> <sup>*</sup> Equal Contribution  </div><a href="https://arxiv.org/abs/2312.09251"><img src="https://img.shields.io/badge/Paper-PDF-orange"></a> <a href="#LICENSE--citation"> <img alt="License: Apache2.0" src="https://img.shields.io/badge/LICENSE-Apache%202.0-blue.svg"/> </a>
<p align="center" width="100%"> <img src="assets/overview.png" width="100%" height="60%"> </p>-
VL-GPT is a generative pre-trained transformer model for vision and language understanding and generation tasks, which can perceive and generate visual and linguistic data concurrently. By employing a straightforward auto-regressive objective, VL-GPT achieves a unified pre-training for both image and text modalities.
-
We also propose an image tokenizer-detokenizer framework for the conversion between raw images and continuous visual embeddings, analogous to the role of the BPE tokenization in language models.
License
This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.