Awesome
Awaker
Awaker is a series of multimodal large models developed by Metabrain AGI,including multimodal large language model (MLLM) Awaker-VL, multimodal retrieval model Awaker-Sou, and video generation model Awaker-Gen.
News
- 2024.11.19: We have released our paper: Awaker2.5-VL.
- 2024.11.17: We have released the Awaker2.5-VL model. We choose to scale the base MLLM model (like Qwen2-VL-7B) with mixture of experts in a stable and efficient way. This thus leads to the new state-of-the-arts on MME-Realworld and MMBench among all the efficient MLLMs (parameters<30B). The model weights and the inference code of Awaker2.5-VL are now available. Superior open-source Awaker-VL models are coming soon.
Performance
MME-RealWorld-CN Benchmark
Models | Parameters | Institutions | Overall | Perception | Reasoning |
---|---|---|---|---|---|
Awaker2.5-VL (ours) | 10.8B | Metabrain AGI | 62.7 | 67.71 | 52.07 |
Qwen2-VL | 8B | Alibaba | 55.5 | 59.80 | 46.46 |
InternVL-2 | 7B | Shanghai AI Lab | 54.3 | 57.97 | 46.65 |
InternVL-Chat-V1.5 | 20B | Shanghai AI Lab | 47.9 | 49.90 | 43.74 |
Claude 3.5 Sonnet | - | Anthropic | 47.0 | 48.25 | 44.31 |
YI-VL-34B | 34B | 01.AI | 42.0 | 42.45 | 41.16 |
CogVLM2-llama3-Chat | 8B | THU & Zhipu AI | 39.8 | 38.57 | 42.25 |
GPT-4o | - | OpenAI | 38.8 | 43.44 | 29.05 |
Mini-Gemini-34B-HD | 34B | CUHK | 38.5 | 38.31 | 38.75 |
Cambrian-1-8B | 8B | NYU | 33.6 | 32.44 | 35.97 |
LLaVA-NeXT-Qwen-72B | 72B | Bytedance | 30.6 | 30.02 | 31.67 |
Gemini-1.5-Pro | - | 28.1 | 36.10 | 11.14 | |
DeepSeek-VL | 7B | DeepSeek-AI | 27.6 | 27.63 | 27.63 |
GPT-4o-mini | - | OpenAI | 25.9 | 26.32 | 25.16 |
MME-RealWorld Benchmark
Models | Parameters | Institutions | Overall | Perception | Reasoning |
---|---|---|---|---|---|
Awaker2.5-VL (ours) | 10.8B | Metabrain AGI | 60.8 | 63.14 | 43.74 |
LLaVA-OneVision | 8B | Bytedance | 57.4 | 59.59 | 41.17 |
Qwen2-VL | 8B | Alibaba | 56.5 | 58.96 | 40.39 |
InternVL-2 | 7B | Shanghai AI Lab | 53.5 | 55.82 | 38.74 |
Claude 3.5 Sonnet | - | Anthropic | 51.6 | 52.90 | 44.12 |
InternVL-Chat-V1.5 | 20B | Shanghai AI Lab | 49.4 | 51.36 | 36.48 |
Mini-Gemini-34B-HD | 34B | CUHK | 45.9 | 48.05 | 31.73 |
GPT-4o | - | OpenAI | 45.2 | 46.43 | 37.61 |
CogVLM2-llama3-Chat | 8B | THU & Zhipu AI | 44.6 | 45.84 | 37.25 |
Cambrian-1-8B | 8B | NYU | 42.7 | 43.82 | 36.16 |
Gemini-1.5-Pro | - | 38.2 | 39.63 | 29.19 | |
GPT-4o-mini | - | OpenAI | 36.4 | 37.12 | 32.48 |
DeepSeek-VL | 7B | DeepSeek-AI | 32.4 | 33.14 | 27.98 |
YI-VL-34B | 34B | 01.AI | 31.0 | 30.97 | 32.45 |
LLaVA-NeXT-Qwen-72B | 72B | Bytedance | 28.7 | 29.01 | 27.86 |
MMBench-CN Benchmark
Models | Parameters | Institutions | Overall | MMBench_v1.1 | MMBench |
---|---|---|---|---|---|
Qwen2-VL-72B | 73.4B | Alibaba | 86.3 | 85.8 | 86.7 |
InternVL2-40B | 40B | Shanghai AI Lab | 85.7 | 84.9 | 86.4 |
InternVL2-Llama-76B | 76B | Shanghai AI Lab | 85.5 | 85.5 | - |
Taiyi | - | Megvii | 85.2 | 85.0 | 85.4 |
JT-VL-Chat-V3.0 | - | China Mobile | 84.7 | 83.5 | 85.8 |
LLaVA-OneVision-72B | 73B | ByteDance | 84.6 | 83.9 | 85.3 |
Step-1.5V | - | StepFun | 84.0 | 83.5 | 84.5 |
Claude3.5-Sonnet-20241022 | - | Anthropic | 83.0 | 82.5 | 83.5 |
Awaker2.5-VL (ours) | 10.8B | Metabrain AGI | 82.6 | 81.8 | 83.4 |
GPT-4o (0513, detail-low) | - | OpenAI | 82.3 | 82.5 | 82.1 |
LLaVA-OneVision-7B | 8B | ByteDance | 81.8 | 80.9 | 82.7 |
GPT-4o (0513, detail-high) | - | OpenAI | 81.8 | 81.5 | 82.1 |
InternVL2-26B | 26B | Shanghai AI Lab | 81.5 | 80.9 | 82.1 |
CongROng | - | CloudWalk | 81.2 | 80.4 | 81.9 |
MMAlaya2 | 26B | DataCanvas | 80.9 | 79.7 | 82.1 |
Ovis1.6-Gemma2-9B | 10.2B | Alibaba | 80.8 | 79.5 | 82.0 |
Qwen2-VL-7B | 8B | Alibaba | 80.5 | 80.3 | 80.6 |
LLaVA-OneVision-72B (SI) | 73B | ByteDance | 80.0 | 81.9 | 78.0 |
InternVL-Chat-V1.5 | 26B | Shanghai AI Lab | 79.9 | 79.1 | 80.7 |
InternLM-XComposer2.5 | 8B | Shanghai AI Lab | 79.9 | 78.8 | 80.9 |
GPT-4o (0806, detail-high) | - | OpenAI | 79.8 | 79.2 | 80.3 |
GPT-4V (0409, detail-high) | - | OpenAI | 79.2 | 78.2 | 80.2 |
MMBench Benchmark
Models | Parameters | Institutions | Overall | MMBench_v1.1 | MMBench |
---|---|---|---|---|---|
Qwen2-VL-72B | 73.4B | Alibaba | 86.5 | 86.1 | 86.9 |
InternVL2-40B | 40B | Shanghai AI Lab | 86.0 | 85.1 | 86.8 |
Taiyi | - | Megvii | 85.7 | 84.7 | 86.7 |
InternVL2-Llama-76B | 76B | Shanghai AI Lab | 85.5 | 85.5 | - |
LLaVA-OneVision-72B | 73B | ByteDance | 85.4 | 85.0 | 85.8 |
JT-VL-Chat-V3.0 | - | China Mobile | 84.5 | 83.6 | 85.4 |
Awaker2.5-VL (ours) | 10.8B | Metabrain AGI | 83.7 | 82.5 | 84.9 |
GPT-4o (0513, detail-high) | - | OpenAI | 83.2 | 83.0 | 83.4 |
GPT-4o (0513, detail-low) | - | OpenAI | 83.2 | 83.1 | 83.3 |
Step-1.5V | - | StepFun | 82.9 | 80.4 | 85.3 |
InternVL2-26B | 26B | Shanghai AI Lab | 82.5 | 81.5 | 83.4 |
Ovis1.6-Gemma2-9B | 10.2B | Alibaba | 82.5 | 81.5 | 83.4 |
RBDash-v1.2-72B | 79B | DLUT | 82.5 | 81.7 | 83.2 |
Qwen2-VL-7B | 8B | Alibaba | 82.4 | 81.8 | 83.0 |
LLaVA-OneVision-7B | 8B | ByteDance | 82.1 | 80.9 | 83.2 |
GPT-4o (0806, detail-high) | - | OpenAI | 82.0 | 81.8 | 82.1 |
LLaVA-OneVision-72B (SI) | 73B | ByteDance | 81.9 | 83.3 | 80.5 |
Qwen-VL-Plus-0809 | - | Alibaba | 81.9 | 81.1 | 82.7 |
CongROng | - | CloudWalk | 81.9 | 80.9 | 82.8 |
Claude3.5-Sonnet-20241022 | - | Anthropic | 81.8 | 80.9 | 82.6 |
MMAlaya2 | 26B | DataCanvas | 81.6 | 80.6 | 82.5 |
InternVL-Chat-V1.5 | 26B | Shanghai AI Lab | 81.3 | 80.3 | 82.3 |
InternLM-XComposer2.5 | 8B | Shanghai AI Lab | 81.1 | 80.1 | 82.0 |
GPT-4V (0409, detail-high) | - | OpenAI | 80.5 | 80.0 | 81.0 |
Environment Requirements
- Clone this repository and navigate to
Awaker
folder.
git clone https://github.com/MetabrainAGI/Awaker.git
cd Awaker/Awaker2.5-VL
- Install Package.
# Install specific transformers
cd transformers
pip install -e .
cd ..
# Install specific peft
pip install peft==0.6.0
cp -r peft /path/to/envs/site-packages/
# Install qwen-vl-utils
pip install qwen-vl-utils[decord]
- Version of torch
torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0
Quickstart
You need to download the model weights of Awaker2.5-VL (the pytorch_model.bin
file) from MetabrainAGI/Awaker2.5-VL.
Here we present a code snippet to show how to use the chat model:
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from peft import MoeConfig, get_peft_model
def find_n_position(target_list, target_value, n):
count = 0
for i, element in enumerate(target_list):
if element == target_value:
count += 1
if count == n:
return i
return -1
# Load the base Qwen2-VL model
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
# Load the Awaker2.5-VL model
target_modules_for_lora = ["q_proj", "k_proj","v_proj"]
target_modules_for_moe = ["o_proj", "gate_proj", "up_proj", "down_proj"]
num_experts = 4
g_enable = True
lora_config = MoeConfig(
r=256,
lora_alpha=512,
target_modules=target_modules_for_lora,
lora_dropout=0.05,
task_type="CAUSAL_LM",
modules_to_save=None,
)
moe_config = MoeConfig(
r=256,
lora_alpha=512,
target_modules=target_modules_for_moe,
lora_dropout=0.05,
task_type="CAUSAL_LM",
modules_to_save=None,
multiple_loras=True,
g_enable=g_enable,
noise_std=0.1,
gates_tmp=1.0,
topk=1,
num_experts=num_experts,
loss_coef=0,
token=False,
freeze_gate=True,
)
model = get_peft_model(model, lora_config, adapter_name='default')
for i in range(num_experts):
model.add_adapter(str(i), moe_config)
if g_enable:
model.add_adapter("g", moe_config)
# Load the weights of Awaker2.5-VL
ckpt = torch.load("/path/to/Awaker2.5-VL/pytorch_model.bin")
model.load_state_dict(ckpt, strict=True)
model.to("cuda")
model.eval()
# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
vision_start_id = 151652
vision_end_id = 151653
im_start_id = 151644
im_end_id = 151645
prompt_pos = [[0,0]]
input_ids = inputs["input_ids"][0].tolist()
if image_inputs:
start_pos = input_ids.index(vision_start_id)
else:
start_pos = find_n_position(input_ids, im_start_id, 2) + 2
end_pos = find_n_position(input_ids, im_end_id, 2)
assert end_pos != -1, "end_pos error!"
assert start_pos != -1, "start_pos error!"
prompt_pos[0][0] = start_pos
prompt_pos[0][1] = end_pos
inputs["prompt_pos"] = torch.tensor(prompt_pos)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
Citation
If you find our work helpful for your research, please consider citing the following BibTeX entry.
@article{awaker2.5-vl,
title = {{Awaker2.5-VL}: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts},
author = {Jinqiang Long and Yanqi Dai and Guoxing Yang and Hongpeng Lin and Nanyi Fei and Yizhao Gao and Zhiwu Lu},
journal = {arXiv preprint arXiv:2411.10669},
year = {2024}
}