Home

Awesome

Multi-modal Auto-regressive Modeling via Visual Words

[arXiv] [BibTeX]

This is the official repository for the multi-modal large language models: VW-LMM

<div align="left"> <img src="assets/radar.png" width="500"/> </div><br/>

Introduction

We propose VW-LMM, a large multi-modal model (LMM) that successfully performs multi-modal auto-regressive modeling with a unified objective for the first time. Specifically, we propose the concept of visual words, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information. Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach. For more technical details, please refer to our paper.

<div align="center"> <img src="assets/model_structure.png" width="800"/> </div><br/>

In order to verify whether the visual words learnt by VW-LMM can realistically reflect the image information, we take VW-LMM-Vicuna-7B as an example to explore. For each patch in the image, we select the token with the highest probability in its corresponding visual words, and compare the region of interest in the image with its visualisation result, visualization is as follows <strong>(Best viewed zoomed-in)</strong>:

<div align="center"> <img src="assets/visualization_vp.png"> </div><br/>

Model Zoo

VersionSizeSupport pseudo image featuresCheckpoint
VW-LMM-Vicuna7BFalseVW-LMM-Vicuna-7b
VW-LMM-Mistral7BFalseVW-LMM-Mistral-7b
VW-LMM-Vicuna-pif7BTrueVW-LMM-Vicuna-pif-7b

VW-LMM, by constructing visual words to introduce visual supervisory information, achieves the best performance among models of the same scale of 7B, and obtains vision-language understanding capability competitive to or even surpassing that of 13B or even larger scale models.

<table> <tr> <td>Methods</td> <td>LLM</td> <td>Res.</td> <td>VQA^v2</td> <td>GQA</td> <td>VisWiz</td> <td>SQA^I</td> <td>VQA^T</td> <td>POPE</td> <td>MMB</td> <td>MMB^CN</td> <td>MM-Vet</td> </tr> <tr> <td colspan=12>*Language Modeling Method*</td> </tr> <tr> <td>IDEFICS-80B</td> <td>LLaMA-65B</td> <td>224</td> <td>60.0</td> <td>45.2</td> <td>36.0</td> <td>--</td> <td>30.9</td> <td>--</td> <td>54.5</td> <td>38.1</td> <td>--</td> </tr> <tr> <td>InstructBLIP</td> <td>Vicuna-13B</td> <td>224</td> <td>--</td> <td>49.5</td> <td>33.4</td> <td>63.1</td> <td>50.7</td> <td>78.9</td> <td>--</td> <td>--</td> <td>25.6</td> </tr> <tr> <td>BLIP-2</td> <td>Vicuna-13B</td> <td>224</td> <td>41.0</td> <td>41.0</td> <td>19.6</td> <td>61.0</td> <td>42.5</td> <td>85.3</td> <td>--</td> <td>--</td> <td>22.4</td> </tr> <tr> <td>LLaVA-v1.5</td> <td>Vicuna-13B</td> <td>336</td> <td>80.0</td> <td>63.3</td> <td>53.6</td> <td>71.6</td> <td>61.3</td> <td>85.9</td> <td>67.7</td> <td>63.6</td> <td>35.4</td> </tr> <tr> <td>InstructBLIP</td> <td>Vicuna-7B</td> <td>224</td> <td>--</td> <td>49.2</td> <td>34.5</td> <td>60.5</td> <td>50.1</td> <td>--</td> <td>36</td> <td>23.7</td> <td>26.2</td> </tr> <tr> <td>IDEFICS-9B</td> <td>LLaMA-7B</td> <td>224</td> <td>50.9</td> <td>38.4</td> <td>35.5</td> <td>--</td> <td>25.9</td> <td>--</td> <td>48.2</td> <td>25.2</td> <td>--</td> </tr> <tr> <td>Qwen-VL</td> <td>Qwen-7B</td> <td>448</td> <td>78.8</td> <td>59.3</td> <td>35.2</td> <td>67.1</td> <td>63.8</td> <td>--</td> <td>38.2</td> <td>7.4</td> <td>--</td> </tr> <tr> <td>Qwen-VL-Chat</td> <td>Qwen-7B</td> <td>448</td> <td>78.2</td> <td>57.5</td> <td>38.9</td> <td>68.2</td> <td>61.5</td> <td>--</td> <td>60.6</td> <td>56.7</td> <td>--</td> </tr> <tr> <td>LLaVA-v1.5</td> <td>Vicuna-7B</td> <td>336</td> <td>78.5</td> <td>62.0</td> <td>50.0</td> <td>66.8</td> <td>58.2</td> <td>85.9</td> <td>64.3</td> <td>58.3</td> <td>30.5</td> </tr> <tr> <td>MoE-LLaVA-2.7B×4-Top2</td> <td>Phi-2-2.7B</td> <td>336</td> <td>77.6</td> <td>61.4</td> <td>43.9</td> <td>68.5</td> <td>51.4</td> <td>86.3</td> <td>65.2</td> <td>--</td> <td>34.3</td> </tr> <tr> <td colspan=12>*Multi-modal Modeling Method*</td> </tr> <tr> <td>Emu2-Chat</td> <td>LLaMA-33B</td> <td>448</td> <td>84.9</td> <td>65.1</td> <td>54.9</td> <td>65.5</td> <td>66.6</td> <td>--</td> <td>--</td> <td>--</td> <td>48.5</td> </tr> <tr> <td>Emu-I</td> <td>LLaMA-13B</td> <td>224</td> <td>62.0</td> <td>46.0</td> <td>38.3</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>36.3</td> </tr> <tr> <td>MM-Interleaved-SFT</td> <td>Vicuna-13B</td> <td>224</td> <td>80.2</td> <td>60.5</td> <td>54.9</td> <td>--</td> <td>61.0</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> </tr> <tr> <td>Unified-IO 2</td> <td>UIO-2-6.8B</td> <td>384</td> <td>79.4</td> <td>--</td> <td>--</td> <td>86.2</td> <td>--</td> <td>87.7</td> <td>71.5</td> <td>--</td> <td>--</td> </tr> <tr> <td>DreamLLM</td> <td>Vicuna-7B</td> <td>224</td> <td>56.6</td> <td>--</td> <td>38.1</td> <td>--</td> <td>34.9</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> </tr> <tr> <td>VL-GPT-I</td> <td>LLaMA-7B</td> <td>224</td> <td>67.2</td> <td>51.5</td> <td>38.9</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> </tr> <tr> <td>LaVIT-v2</td> <td>LLaMA2-7B</td> <td>224</td> <td>68.3</td> <td>47.9</td> <td>41.0</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> <td>--</td> </tr> <tr> <td>VW-LMM</td> <td>Vicuna-7B</td> <td>336</td> <td>78.9</td> <td>62.7</td> <td>48.3</td> <td>68.1</td> <td>57.6</td> <td>85.9</td> <td>65.9</td> <td>59.8</td> <td>31.3</td> </tr> <tr> <td>VW-LMM</td> <td>Mistral-7B</td> <td>336</td> <td>80.8</td> <td>65.4</td> <td>58.5</td> <td>75.9</td> <td>63.1</td> <td>87.0</td> <td>80.6</td> <td>79.0</td> <td>44.0</td> </tr> </table>

Setup

Requirements

git clone https://github.com/pengts/VW-LMM.git
cd VW-LMM
pip install -r requirements.txt

Multi-modal Inference

Model Configurations

model_path="VW-LMM-Vicuna"
conv_mode="vicuna_v1"
model_base="llama"
device = "cuda"
model_path="VW-LMM-Mistral"
conv_mode="mistral"
model_base="mistral"
device = "cuda"

VW-LMM-Vicuna-pif

model_path="VW-LMM-Vicuna-pif"
conv_mode="vicuna_v1"
model_base="llama"
device = "cuda"

Model Initialization

disable_torch_init()
model_path = os.path.expanduser(model_path)
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, model_name, model_base,device=device)

Input Processing

question="Write an exhaustive depiction of the given image."
image_path="./example.jpg"
qs = question
qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

image = Image.open(image_path).convert('RGB')
image_tensor = process_images([image], image_processor, model.config)[0].unsqueeze(0).to(device)
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(device)

Inference

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor.to(dtype=torch.float16, device=device, non_blocking=True),
        do_sample= False,
        temperature=0,
        top_p=None,
        num_beams=1,
        max_new_tokens=128,
        use_cache=True)

input_token_len = input_ids.shape[1]
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
if n_diff_input_output > 0:
    print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)

Acknowledgement

We are grateful for the following awesome projects when implementing VW-LMM:

<a name="Citing"></a>Citation

Consider giving this repository a star and cite VW-LMM in your publications if it helps your research.

@misc{peng2024multimodal,
      title={Multi-modal Auto-regressive Modeling via Visual Words}, 
      author={Tianshuo Peng and Zuchao Li and Lefei Zhang and Hai Zhao and Ping Wang and Bo Du},
      year={2024},
      eprint={2403.07720},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}