Home

Awesome

Multi-Modality

Pali3

pali

"Figure 1: Overview of the PaLI-3 (5B) model: images are encoded into visual tokens individually by the contrastively pretrained 2B SigLIP vision model. Along with a query, these visual tokens are passed to an 3B encoder-decoder UL2 Transformer which produces the desired answer."

Vit trained with siglip loss -> embeddings -> ul2 -> text tokens

text -> tokenizer -> embeddings -> ul2 -> text tokens

ARXVIV PAPER LINK


Installation

pip install pali3


Usage:

import torch
from pali3.main import Pali3

model = Pali3()

img = torch.randn(1, 3, 256, 256)
prompt = torch.randint(0, 256, (1, 1024))
mask = torch.ones(1, 1024).bool()
output_text = torch.randint(0, 256, (1, 1024))

result = model.process(img, prompt, output_text, mask)
print(result)



Architecture

Here is the ASCII representation of the model architecture and the stages of training:

Model Architecture:

Image Input
    |
    V
Contrastive Vision Encoder (ViT-G/14)
    |
    V
Transformer Encoder
    |
    V
Transformer Decoder
    |
    V
Text Output

Stages of Training:

Stage 0: Unimodal pretraining
    |
    V
Stage 1: Multimodal training
    |
    V
Stage 2: Resolution increase
    |
    V
Task specialization (transfer)

Model Training Phases

The model architecture consists of a contrastive vision encoder (ViT-G/14) that encodes the image into tokens. These tokens are passed to a transformer encoder and then to a transformer decoder that generates a text output.

The training procedure consists of multiple stages:

Please note that this is a high-level representation and the actual implementation might involve more details and complexities.


Vit Architecture

Here are the ASCII diagrams for the ViT (Vision Transformer)

ViT (Vision Transformer):

Image Input
    |
    V
Patch Extraction
    |
    V
Linear Embedding
    |
    V
Positional Encoding
    |
    V
Transformer Encoder Blocks (Multiple Layers)
    |
    V
Classification Head (Optional)
    |
    V
Output (Image Embeddings)

The ViT starts with patch extraction from the input image. These patches are then linearly embedded and positional encodings are added. The resulting sequence of patch embeddings is passed through multiple layers of transformer encoders. Optionally, a classification head can be added at the end to get class probabilities for image classification tasks. The output of the ViT is the image embeddings.


UL2 Encoder/Decoder Transformer

Encoder-Decoder Architecture:

Input (Image + Text Tokens)
    |
    V
Transformer Encoder
    |
    V
Encoder Output (Context for Decoder)
    |
    V
Transformer Decoder
    |
    V
Output (Generated Text)

The encoder-decoder architecture starts with the input, which is a combination of image and text tokens in this case. The input is passed through a transformer encoder, which generates a context for the decoder. The transformer decoder then uses this context to generate the output text.

Dataset Strategy

Here is a table summarizing the key datasets mentioned in the paper along with their metadata and source links:

DatasetTypeSizeTasksSource
ImageNet-22kImage Classification14M images, 21,841 classesPretraininghttps://github.com/google-research-datasets/ImageNet-21k-P
MS COCOImage Captioning, VQA330K images, 80 object categoriesEvaluationhttps://cocodataset.org
Flickr30kImage Captioning31K imagesEvaluationhttps://www.kaggle.com/dataset/flickr30k
VQAv2Visual QA204K images, 1.1M questionsEvaluationhttps://visualqa.org/download.html
GQAVisual QA22M graph-based questionsEvaluationhttps://cs.stanford.edu/people/dorarad/gqa/download.html
RefCOCO/RefCOCO+Referring Expression19,994/19,992 imagesEvaluationhttps://github.com/lichengunc/refer
TextCapsImage Captioning31,014 imagesEvaluationhttps://textvqa.org/textcaps
TextVQAVisual QA28,408 imagesEvaluationhttps://textvqa.org/index.html
STVQAVisual QA249,991 QA pairsEvaluationhttps://tvqa.cs.unc.edu/
OCR-VQAVisual QA45,336 imagesEvaluationhttps://ocrvqa.cloudcv.org/
DocVQAVisual QA5,000 document imagesEvaluationhttps://github.com/doc-vqa/docvqa
InfographiVQAVisual QA10,047 infographic imagesEvaluationhttps://github.com/doc-vqa/InfoVQA
WebLIImage-Text Pairs72M image-text pairs in 100+ languagesPretraininghttps://laion.ai/blogs/webli/
JFT-300MImage Classification303M images, 18,291 classesPretraininghttps://github.com/google-research-datasets/jft300m
CrossModal-3600Image-Text Retrieval31K images, 3600 lang-image pairsEvaluationhttps://laion.ai/crossmodal-3600/

License

MIT

Todo

Citation

@misc{2310.09199,
Author = {Xi Chen and Xiao Wang and Lucas Beyer and Alexander Kolesnikov and Jialin Wu and Paul Voigtlaender and Basil Mustafa and Sebastian Goodman and Ibrahim Alabdulmohsin and Piotr Padlewski and Daniel Salz and Xi Xiong and Daniel Vlasic and Filip Pavetic and Keran Rong and Tianli Yu and Daniel Keysers and Xiaohua Zhai and Radu Soricut},
Title = {PaLI-3 Vision Language Models: Smaller, Faster, Stronger},
Year = {2023},
Eprint = {arXiv:2310.09199},
}