Awesome

Multi-modal Auto-regressive Modeling via Visual Words

[arXiv] [BibTeX]

This is the official repository for the multi-modal large language models: VW-LMM

Introduction

We propose VW-LMM, a large multi-modal model (LMM) that successfully performs multi-modal auto-regressive modeling with a unified objective for the first time. Specifically, we propose the concept of visual words, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information. Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach. For more technical details, please refer to our paper.

In order to verify whether the visual words learnt by VW-LMM can realistically reflect the image information, we take VW-LMM-Vicuna-7B as an example to explore. For each patch in the image, we select the token with the highest probability in its corresponding visual words, and compare the region of interest in the image with its visualisation result, visualization is as follows <strong>(Best viewed zoomed-in)</strong>:

Model Zoo

Version	Size	Support pseudo image features	Checkpoint
VW-LMM-Vicuna	7B	False	VW-LMM-Vicuna-7b
VW-LMM-Mistral	7B	False	VW-LMM-Mistral-7b
VW-LMM-Vicuna-pif	7B	True	VW-LMM-Vicuna-pif-7b

VW-LMM, by constructing visual words to introduce visual supervisory information, achieves the best performance among models of the same scale of 7B, and obtains vision-language understanding capability competitive to or even surpassing that of 13B or even larger scale models.

<table> <tr> <td>Methods</td> <td>LLM</td> <td>Res.</td> <td>VQA^v2</td> <td>GQA</td> <td>VisWiz</td> <td>SQA^I</td> <td>VQA^T</td> <td>POPE</td> <td>MMB</td> <td>MMB^CN</td> <td>MM-Vet</td> </tr> <tr> <td colspan=12>*Language Modeling Method*</td> </tr> <tr> <td>IDEFICS-80B</td> <td>LLaMA-65B</td> <td>224</td> <td>60.0</td> <td>45.2</td> <td>36.0</td> <td>--</td> <td>30.9</td> <td>--</td> <td>54.5</td> <td>38.1</td> <td>--</td> </tr> <tr> <td>InstructBLIP</td> <td>Vicuna-13B</td> <td>224</td> <td>--</td> <td>49.5</td> <td>33.4</td> <td>63.1</td> <td>50.7</td> <td>78.9</td> <td>--</td> <td>--</td> <td>25.6</td> </tr> <tr> <td>BLIP-2</td> <td>Vicuna-13B</td> <td>224</td> <td>41.0</td> <td>41.0</td> <td>19.6</td> <td>61.0</td> <td>42.5</td> <td>85.3</td> <td>--</td> <td>--</td> <td>22.4</td> </tr> <tr> <td>LLaVA-v1.5</td> <td>Vicuna-13B</td> <td>336</td> <td>80.0</td> <td>63.3</td> <td>53.6</td> <td>71.6</td> <td>61.3</td> <td>85.9</td> <td>67.7</td> <td>63.6</td> <td>35.4</td> </tr> <tr> <td>InstructBLIP</td> <td>Vicuna-7B</td> <td>224</td> <td>--</td> <td>49.2</td> <td>34.5</td> <td>60.5</td> <td>50.1</td> <td>--</td> <td>36</td> <td>23.7</td> <td>26.2</td> </tr> <tr> <td>IDEFICS-9B</td> <td>LLaMA-7B</td> <td>224</td> <td>50.9</td> <td>38.4</td> <td>35.5</td> <td>--</td> <td>25.9</td> <td>--</td> <td>48.2</td> <td>25.2</td> <td>--</td> </tr> <tr> <td>Qwen-VL</td> <td>Qwen-7B</td> <td>448</td> <td>78.8</td> <td>59.3</td> <td>35.2</td> <td>67.1</td> <td>63.8</td> <td>--</td> <td>38.2</td> <td>7.4</td> <td>--</td> </tr> <tr> <td>Qwen-VL-Chat</td> <td>Qwen-7B</td> <td>448</td> <td>78.2</td> <td>57.5</td> <td>38.9</td> <td>68.2</td> <td>61.5</td> <td>--</td> <td>60.6</td> <td>56.7</td> <td>--</td> </tr> <tr> <td>LLaVA-v1.5</td> <td>Vicuna-7B</td> <td>336</td> <td>78.5</td> <td>62.0</td> <td>50.0</td> <td>66.8</td> <td>58.2</td> <td>85.9</td> <td>64.3</td> <td>58.3</td> <td>30.5</td> </tr> <tr> <td>MoE-LLaVA-2.7B×4-Top2</td> <td>Phi-2-2.7B</td> <td>336</td> <td>77.6</td> <td>61.4</td> <td>43.9</td> <td>68.5</td> <td>51.4</td> <td>86.3</td> <td>65.2</td> <td>--</td> <td>34.3</td> </tr> <tr> <td colspan=12>*Multi-modal Modeling Method*</td> </tr> <tr> <td>Emu2-Chat</td> <td>LLaMA-33B</td> <td>448</td> <td>84.9</td> <td>65.1</td> <td>54.9</td> <td>65.5</td> <td>66.6</td> <td>--</td> <td>--</td> <td>--</td> <td>48.5</td> </tr> <tr> <td>Emu-I</td> <td>LLaMA-13B</td> <td>224</td> <td>62.0</td> <td>46.0</td> <td>38.3</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>36.3</td> </tr> <tr> <td>MM-Interleaved-SFT</td> <td>Vicuna-13B</td> <td>224</td> <td>80.2</td> <td>60.5</td> <td>54.9</td> <td>--</td> <td>61.0</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> </tr> <tr> <td>Unified-IO 2</td> <td>UIO-2-6.8B</td> <td>384</td> <td>79.4</td> <td>--</td> <td>--</td> <td>86.2</td> <td>--</td> <td>87.7</td> <td>71.5</td> <td>--</td> <td>--</td> </tr> <tr> <td>DreamLLM</td> <td>Vicuna-7B</td> <td>224</td> <td>56.6</td> <td>--</td> <td>38.1</td> <td>--</td> <td>34.9</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> </tr> <tr> <td>VL-GPT-I</td> <td>LLaMA-7B</td> <td>224</td> <td>67.2</td> <td>51.5</td> <td>38.9</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> </tr> <tr> <td>LaVIT-v2</td> <td>LLaMA2-7B</td> <td>224</td> <td>68.3</td> <td>47.9</td> <td>41.0</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> </tr> <tr> <td>VW-LMM</td> <td>Vicuna-7B</td> <td>336</td> <td>78.9</td> <td>62.7</td> <td>48.3</td> <td>68.1</td> <td>57.6</td> <td>85.9</td> <td>65.9</td> <td>59.8</td> <td>31.3</td> </tr> <tr> <td>VW-LMM</td> <td>Mistral-7B</td> <td>336</td> <td>80.8</td> <td>65.4</td> <td>58.5</td> <td>75.9</td> <td>63.1</td> <td>87.0</td> <td>80.6</td> <td>79.0</td> <td>44.0</td> </tr> </table>

Setup

Requirements

git clone https://github.com/pengts/VW-LMM.git
cd VW-LMM
pip install -r requirements.txt

Multi-modal Inference

Model Configurations

VW-LMM-Vicuna

model_path="VW-LMM-Vicuna"
conv_mode="vicuna_v1"
model_base="llama"
device = "cuda"

VW-LMM-Mistral

model_path="VW-LMM-Mistral"
conv_mode="mistral"
model_base="mistral"
device = "cuda"

VW-LMM-Vicuna-pif

model_path="VW-LMM-Vicuna-pif"
conv_mode="vicuna_v1"
model_base="llama"
device = "cuda"

Model Initialization

disable_torch_init()
model_path = os.path.expanduser(model_path)
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, model_name, model_base,device=device)

Input Processing

question="Write an exhaustive depiction of the given image."
image_path="./example.jpg"
qs = question
qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

image = Image.open(image_path).convert('RGB')
image_tensor = process_images([image], image_processor, model.config)[0].unsqueeze(0).to(device)
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(device)

Inference

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor.to(dtype=torch.float16, device=device, non_blocking=True),
        do_sample= False,
        temperature=0,
        top_p=None,
        num_beams=1,
        max_new_tokens=128,
        use_cache=True)

input_token_len = input_ids.shape[1]
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
if n_diff_input_output > 0:
    print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)

Acknowledgement

We are grateful for the following awesome projects when implementing VW-LMM:

LLaVA: Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.
LLaMA: Open and Efficient Foundation Language Models
Vicuna: Open-source LLM with amazing language capabilities!
Mistral: A 7B transformer model, fast-deployed and easily customisable. Small, yet very powerful for a variety of use cases.

<a name="Citing"></a>Citation

Consider giving this repository a star and cite VW-LMM in your publications if it helps your research.

@misc{peng2024multimodal,
      title={Multi-modal Auto-regressive Modeling via Visual Words}, 
      author={Tianshuo Peng and Zuchao Li and Lefei Zhang and Hai Zhao and Ping Wang and Bo Du},
      year={2024},
      eprint={2403.07720},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}