Home

Awesome

中文 | English

<p align="center"> <br> <img src="./pics/banner.png" width="600"/> <br> </p> <p align="center"> <a href="https://github.com/iflytek/VLE/blob/main/LICENSE"> <img alt="GitHub" src="https://img.shields.io/github/license/iflytek/VLE.svg?color=blue&style=flat-square"> <img alt="GitHub repo size" src="https://img.shields.io/github/repo-size/iflytek/vle"> <img alt="GitHub top language" src="https://img.shields.io/github/languages/top/iflytek/vle"> <img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/iflytek/vle"> </a> </p>

VLE: Vision-Language Encoder

Multimodal pre-trained models are trained on massive multimodal data, and they can utilize information from different modalities and perform various cross-modal tasks.

In this repository, we introduce VLE (Vision-Language Encoder), an image-text multimodal understanding model built on the pre-trained text and image encoders. It can be used for multimodal discriminative tasks such as visual question answering and image-text retrieval. Especially on the visual commonsense reasoning (VCR) task, which requires high-level language understanding and reasoning skills, VLE achieves the best performance among the public methods.

Recently, LLMs (Large Language Models) have achieved great success and have been used for a wide range of text tasks, including translation, question answering, text summarization, etc. While LLMs are unimodal, their abilities can be leveraged for multimodal understanding tasks. We propose a VQA+LLM pipeline that integrates multimodal models with LLMs for the visual question answering task. It helps the VQA model generate more accurate and fluent answers.

We open-source VLE-related resources for promoting academic research and better facilitating our community.

Try our VLE-based VQA Demo at 🤗Space 👇👇👇

<div align=center><a href="https://huggingface.co/spaces/hfl/VQA_VLE_LLM"><img src="pics/demo-banner.png" alt="VLE-based VQA Demo" width="800" /></a></div>

Chinese LERT | Chinese and English PERT | Chinese MacBERT | ChineseMiniRBT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | Knowledge distillation tool TextBrewer | Model pruning tool TextPruner

More resources released by HFL: https://github.com/iflytek/HFL-Anthology

Table of Contents

SectionDescription
IntroductionIntroduction to VLE
DownloadsDownload links for VLE
ComparisonComparison of VLE with other models
VQA with LLMVisual question answering with LLM
UsageHow to load VLE for different tasks

Introduction

Structure

The structure of VLE is similar to METER, which consists of two unimodal encoders for text and image separately, followed by a crossmodal fusion module. However, there are several structural differences between VLE and METER:

Pre-training

VLE is pre-trained with image-caption pairs. There are four objectives applied during the pre-training stage:

VLE models are pre-trained on 14M public English image-caption pairs for 25k steps with a batch size of 2048.

The following figure illustrates the VLE structure and the pre-training objectives (for simplicity, we omit the PBC objective in the figure).

<div align=center><img src="pics/model.png" alt="VLE structure and pre-training tasks" width="500" /></div>

Adaptation for downstream tasks

Visual Question Answering (VQA)

Visual Commonsense Reasoning (VCR)

Downloads

The model weights are in PyTorch format and can be downloaded through the 🤗 transformers model hub. You can either download the weights and configurations manually or initialize a VLE model with from_pretrained(model_name) method in your code. See Usage for details.

Pre-trained Checkpoints

ModelText EncoderImage Encoder# Params<sup>*</sup>MODEL_NAMELink
VLE-baseDeBERTa-v3-baseCLIP-ViT-base-patch16378Mhfl/vle-baselink
VLE-largeDeBERTa-v3-largeCLIP-ViT-large-patch14930Mhfl/vle-largelink

<sup>*</sup> : We exclude task heads when counting the number of parameters.

Fine-tuned Checkpoints

ModelText EncoderImage EncoderMODEL_NAMELink
VLE-base-for-VQADeBERTa-v3-baseCLIP-ViT-base-patch16hfl/vle-base-for-vqalink
VLE-large-for-VQADeBERTa-v3-largeCLIP-ViT-large-patch14hfl/vle-large-for-vqalink
VLE-base-for-VCR-q2aDeBERTa-v3-baseCLIP-ViT-base-patch16hfl/vle-base-for-vcr-q2alink
VLE-large-for-VCR-q2aDeBERTa-v3-largeCLIP-ViT-large-patch14hfl/vle-large-for-vcr-q2alink
VLE-base-for-VCR-qa2rDeBERTa-v3-baseCLIP-ViT-base-patch16hfl/vle-base-for-vcr-qa2rlink
VLE-large-for-VCR-qa2rDeBERTa-v3-largeCLIP-ViT-large-patch14hfl/vle-large-for-vcr-qa2rlink

Comparison

In the following table, we compare the performance of VLE with METER and other multimodal models. The VQA results are on the test-dev set, and the VCR results are on the dev set.

ModelVQAVCR (QA2R)VCR (Q2A)#Params#PT data<sup>*</sup>
CoCa82.3--2.1 Bunknown
BeiT-384.2--1.9 B21M(I-T) + 14M(I) + 160G(T)
OFA82.0--930M20M(I-T) + 39M(I) + 140G(T)
BLIP78.3--385M~130M(I-T)
METER-base77.7 (76.8<sup>†‡</sup>)79.8<sup>§</sup>77.6<sup>§</sup>345M9M(I-T)
METER-Huge80.3--878M20M(I-T)
VLE-base77.6<sup></sup>83.7<sup>§</sup>79.9<sup>§</sup>378M15M(I-T)
VLE-large79.3<sup></sup>87.5<sup>§</sup>84.3<sup>§</sup>930M15M(I-T)

<sup></sup> : Result from our reimplementation.

<sup></sup> : Fine-tuning hyperparameters: lr=7e-6, batch_size={256, 512}, num_epochs=10

<sup>§</sup> : Fine-tuning hyperparameters: lr=1e-5, batch_size=128, num_epochs=5

<sup>*</sup> : Pre-training data. I-T: Image-caption pairs. I: Images. T: Text.

From the above results, we can see that:

VQA with LLM

Generating Accurate and Fluent VQA Answers

LLMs have achieved great success on a wide range of text tasks, while the abilities of LLMs can also be leveraged for multimodal understanding tasks. Specifically, we present a VQA+LLM pipeline that integrates multimodal models with LLMs for the visual question answering task, which helps the VQA model to generate more accurate and fluent answers.

The workflows are shown in the figure below.

<div align=center><img src="pics/VQALLM_workflow.png" alt="Workflows" width="800" /></div>

(a) VQA: This is the standard way to perform the VQA task with a discriminative model. The question and the image are fed into the multimodal model, and the model is trained to predict the correct answer labels.

(b) VQA + LLM: The captioning model generates a caption of the image. The caption, question, and answer candidates predicted by the VQA model are concatenated and fed to the LLM. The LLM is asked to give the most reasonable answer.

We find that VQA+LLM can not only make more accurate predictions, but also generate more fluent and readable predictions. We list some examples:

<div align=center><img src="pics/truck.png" alt="men and truck" width="700" /></div> <div align=center><img src="pics/fishing.png" alt="hatch" width="700" /></div>

The demo is available at : https://huggingface.co/spaces/hfl/VQA_VLE_LLM

Usage

Requirements

The model classes and utilities are defined in the *.py files in models/VLE. To import VLE into your code, just copy models directory into your project.

To run the following demo code, git clone the repository and cd into it, ensuring you are in the repository's root directory.

Load the VLEModel

from models.VLE import VLEModel, VLEProcessor
from PIL import Image
import torch

model_name="hfl/vle-large"
images = [Image.open('pics/dogs.png')]
text = ["There are dogs on the grass."]

model = VLEModel.from_pretrained(model_name)
vle_processor = VLEProcessor.from_pretrained(model_name)
multimodal_inputs = vle_processor(text=text,images=images, return_tensors='pt',padding=True)

#forward
vle_output = model(**multimodal_inputs)

Inference

Visual Question Answering (VQA)

from models.VLE import VLEForVQA, VLEProcessor, VLEForVQAPipeline
from PIL import Image

model_name="hfl/vle-base-for-vqa"
text= "What is the color of the floor?"
image = Image.open("pics/door.png")

model = VLEForVQA.from_pretrained(model_name)
vle_processor = VLEProcessor.from_pretrained(model_name)
vqa_pipeline = VLEForVQAPipeline(model=model, device='cpu', vle_processor=vle_processor)

vqa_answers = vqa_pipeline(image=image, question=text, top_k=5)
print(f"Question: {text}. Answers: {vqa_answers}")

Image-Text Matching

from models.VLE import VLEForITM, VLEProcessor, VLEForITMPipeline
from PIL import Image

model_dir = 'hfl/vle-base'
itm_text = ["a photo of a cat.", "a photo of dogs."]
itm_images = Image.open("pics/dogs.png")

print("Init ITM model")
model = VLEForITM.from_pretrained(model_dir)
vle_processor = VLEProcessor.from_pretrained(model_dir)

print("init ITM pipeline")
itm_pipeline = VLEForITMPipeline(model=model, device='cpu', vle_processor=vle_processor)
itm_pred = itm_pipeline([{"image": itm_images, "text": itm_text[0]}, 
                         {"image": itm_images, "text": itm_text[1]}])

for t, pred in zip(itm_text,itm_pred):
    print(t,pred)

Patch Box Classification

from models.VLE import VLEForPBC, VLEProcessor, VLEForPBCPipeline
from PIL import Image

model_dir = 'hfl/vle-base'
pbc_text = "pink tongues"
pbc_image = Image.open("pics/dogs.png")

print("Init PBC model")
model = VLEForPBC.from_pretrained(model_dir)
vle_processor = VLEProcessor.from_pretrained(model_dir)

print("init PBC pipeline")
pbc_pipeline = VLEForPBCPipeline(model=model, device='cpu', vle_processor=vle_processor)
pbc_pred = pbc_pipeline(image=pbc_image,text=pbc_text)
print(pbc_text)
pbc_pred['image'].save('pics/pink_tongues.png')

Visual Commonsense Reasoning (VCR)

Please follow the instructions in examples/VCR/README.md

Fine-tuning

Fine-tuning on VQA

Please follow the instructions in examples/VQA/README.md

Follow us

Welcome to follow the official WeChat account of HFL to keep up with the latest technical developments.

qrcode.png

Disclaimer

This repository's resources are solely intended for academic purposes, and we assume no responsibility for any unforeseen damages or losses that may result from their use.

This is not an official product by iFLYTEK Co., Ltd.

Issues

If you have questions, please submit them in a GitHub Issue.