Home

Awesome

Bunny: A family of lightweight multimodal models

<p align="center"> <img src="./icon.png" alt="Logo" width="350"> </p>

📖 Technical report | 🤗 Data | 🤖 Data | 🤗 HFSpace 🐰 Demo

Bunny-Llama-3-8B-V: 🤗 v1.1 | 🤗 v1.0 | 🤗 v1.0-GGUF

Bunny-4B: 🤗 v1.1 | 🤗 v1.0 | 🤗 v1.0-GGUF

Bunny is a family of lightweight but powerful multimodal models. It offers multiple plug-and-play vision encoders, like EVA-CLIP, SigLIP and language backbones, including Llama-3-8B, Phi-3-mini, Phi-1.5, StableLM-2, Qwen1.5, MiniCPM and Phi-2. To compensate for the decrease in model size, we construct more informative training data by curated selection from a broader data source.

We are thrilled to introduce Bunny-Llama-3-8B-V, the pioneering vision-language model based on Llama-3, showcasing exceptional performance. The v1.1 version accepts high-resolution images up to 1152x1152.

comparison_8B

Moreover, our Bunny-4B model built upon SigLIP and Phi-3-mini outperforms the state-of-the-art MLLMs, not only in comparison with models of similar size but also against larger MLLMs (7B and 13B). Also, the v1.1 version accepts high-resolution images up to 1152x1152.

<details> <summary>Expand to see the performance of Bunny-4B</summary> <IMG src="comparison_4B.png"/> </details>

News and Updates

Quickstart

HuggingFace transformers

Here we show a code snippet to show you how to use Bunny-v1.1-Llama-3-8B-V, Bunny-v1.1-4B, Bunny-v1.0-3B and so on with HuggingFace transformers.

This snippet is only used for above models because we manually combine some configuration code into a single file for users' convenience. For example, you can check modeling_bunny_llama.py and configuration_bunny_llama.py and their related parts in the source code of Bunny to see the difference. For other models including models trained by yourself, we recommend loading them with installing the source code of Bunny. Or you can copy files like modeling_bunny_llama.py and configuration_bunny_llama.py into your model and modify auto_map in config.json, but we can't guarantee its correctness and you may need to modify some code to fit your model.

Before running the snippet, you need to install the following dependencies:

pip install torch transformers accelerate pillow

If the CUDA memory is enough, it would be faster to execute this snippet by setting CUDA_VISIBLE_DEVICES=0.

Users especially those in Chinese mainland may want to refer to a HuggingFace mirror site.

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# set device
device = 'cuda'  # or cpu
torch.set_default_device(device)

model_name = 'BAAI/Bunny-v1_1-Llama-3-8B-V' # or 'BAAI/Bunny-Llama-3-8B-V' or 'BAAI/Bunny-v1_1-4B' or 'BAAI/Bunny-v1_0-4B' or 'BAAI/Bunny-v1_0-3B' or 'BAAI/Bunny-v1_0-3B-zh' or 'BAAI/Bunny-v1_0-2B-zh'
offset_bos = 1 # for Bunny-v1_1-Llama-3-8B-V, Bunny-Llama-3-8B-V, Bunny-v1_1-4B, Bunny-v1_0-4B and Bunny-v1_0-3B-zh
# offset_bos = 0 for Bunny-v1_0-3B and Bunny-v1_0-2B-zh

# create model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16, # float32 for cpu
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True)

# text prompt
prompt = 'Why is the image funny?'
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)

# image, sample images can be found in https://huggingface.co/BAAI/Bunny-v1_1-Llama-3-8B-V/tree/main/images
image = Image.open('example_2.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=100,
    use_cache=True,
    repetition_penalty=1.0 # increase this to avoid chattering
)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

ModelScope

We advise users especially those in Chinese mainland to use ModelScope. snapshot_download can help you solve issues concerning downloading checkpoints.

<details> <summary>Expand to see the snippet</summary>

Before running the snippet, you need to install the following dependencies:

pip install torch modelscope transformers accelerate pillow

If the CUDA memory is enough, it would be faster to execute this snippet by setting CUDA_VISIBLE_DEVICES=0.

import torch
import transformers
from modelscope import AutoTokenizer, AutoModelForCausalLM
from modelscope.hub.snapshot_download import snapshot_download
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# set device
device = 'cuda'  # or cpu
torch.set_default_device(device)

model_name = 'BAAI/Bunny-Llama-3-8B-V' # or 'BAAI/Bunny-v1.0-3B' or 'BAAI/Bunny-v1.0-3B-zh' or 'BAAI/Bunny-v1.0-2B-zh'
offset_bos = 1 # for Bunny-Llama-3-8B-V and Bunny-v1.0-3B-zh
# offset_bos = 0 for Bunny-v1.0-3B and Bunny-v1.0-2B-zh

# create model
snapshot_download(model_id='thomas/siglip-so400m-patch14-384')
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16, # float32 for cpu
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True)

# text prompt
prompt = 'Why is the image funny?'
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)

# image, sample images can be found in images folder on https://www.modelscope.cn/models/BAAI/Bunny-Llama-3-8B-V/files
image = Image.open('example_2.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=100,
    use_cache=True,
    repetition_penalty=1.0 # increase this to avoid chattering
)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
</details>

Model Zoo

Evaluation

CheckpointMME$^\text{P}$MME$^\text{C}$MMB$^{\text{T}/\text{D}}$MMB-CN$^{\text{T}/ \text{D}}$SEED(-IMG)MMMU$^{\text{V}/\text{T}}$VQA$^\text{v2}$GQASQA$^\text{I}$POPE
bunny-phi-1.5-eva-lora1213.7278.960.9/56.8-56.4/64.130.0/28.476.560.458.286.1
bunny-stablelm-2-eva-lora1301.0235.058.4/56.4-55.3/62.829.8/29.474.656.760.084.8
bunny-phi-2-eva-lora1421.0285.468.6/67.4-62.2/70.235.9/32.678.962.369.187.1
bunny-phi-1.5-siglip-lora1230.0237.561.2/59.7-57.7/65.330.0/29.178.061.161.385.8
bunny-stablelm-2-siglip-lora1366.8236.165.1/62.8-58.8/67.529.9/29.878.960.961.185.9
Bunny-v1.0-2B-zh/bunny-qwen1.5-1.8b-siglip1300.8254.359.8/59.159.5/58.555.4/62.334.4/30.476.659.664.685.8
Bunny-v1.0-3B-zh/bunny-minicpm-siglip1410.4281.466.1/65.564.9/63.659.6/67.335.4/32.478.660.868.786.5
Bunny-v1.0-3B/bunny-phi-2-siglip1488.8289.369.2/68.6-62.5/70.738.2/33.079.862.570.986.8
Bunny-v1.0-4B1495.2338.974.0/73.5-64.5/72.140.1/39.181.563.575.286.7
Bunny-v1.1-4B1581.5361.175.7/74.266.5/64.564.9/72.541.4/38.482.163.278.387.2
Bunny-Llama-3-8B-V1588.9321.177.2/76.773.8/72.365.9/73.342.8/39.082.664.880.486.9
Bunny-1.1-Llama-3-8B-V1644.1367.578.1/77.274.3/74.866.2/73.543.3/39.082.964.079.987.2

The small model with the best performance is denoted as Bunny-v1.0-3B or bunny-phi-2-siglip, whose merged weights can be found here and the LoRA weights can be found here.

We also provide two models that focus on Chinese QA ability, namely Bunny-v1.0-3B-zh (bunny-minicpm-siglip) and Bunny-v1.0-2B-zh (bunny-qwen1.5-1.8b-siglip). The merged weights can be found here and here. The LoRA weights can be found here and here.

Training Tutorial

CheckpointVision EncoderLLMPretrain weightsTraining Tutorial
bunny-phi-1.5-eva-loraEVA02_CLIP_L_336_psz14_s6Bmicrosoft/phi-1_5bunny-pretrain-phi-1.5-evalink
bunny-stablelm-2-eva-loraEVA02_CLIP_L_336_psz14_s6Bstabilityai/stablelm-2-1_6bbunny-pretrain-stablelm-2-evalink
bunny-phi-2-eva-loraEVA02_CLIP_L_336_psz14_s6Bmicrosoft/phi-2bunny-pretrain-phi-2-evalink
bunny-phi-1.5-siglip-lorasiglip-so400m-patch14-384microsoft/phi-1_5bunny-pretrain-phi-1.5-sigliplink
bunny-stablelm-2-siglip-lorasiglip-so400m-patch14-384stabilityai/stablelm-2-1_6bbunny-pretrain-stablelm-2-sigliplink
bunny-qwen1.5-1.8b-siglip-lorasiglip-so400m-patch14-384Qwen/Qwen1.5-1.8Bbunny-pretrain-qwen1.5-1.8b-sigliplink
bunny-minicpm-siglip-lorasiglip-so400m-patch14-384openbmb/MiniCPM-2B-history (step 280000)bunny-pretrain-minicpm-sigliplink
bunny-phi-2-siglip-lorasiglip-so400m-patch14-384microsoft/phi-2bunny-pretrain-phi-2-sigliplink
Bunny-v1.0-4Bsiglip-so400m-patch14-384microsoft/Phi-3-mini-4k-instructbunny-pretrain-phi-3-sigliplink
Bunny-v1.1-4Bsiglip-so400m-patch14-384microsoft/Phi-3-mini-4k-instructbunny-pretrain-phi-3-siglip-s2link
Bunny-Llama-3-8B-Vsiglip-so400m-patch14-384meta-llama/Meta-Llama-3-8B-Instructbunny-pretrain-llama3-8b-sigliplink
Bunny-v1.1-Llama-3-8B-Vsiglip-so400m-patch14-384meta-llama/Meta-Llama-3-8B-Instructbunny-pretrain-llama3-8b-siglip-s2link

Install

Either start from our docker or install locally on your own.

Start from Our Docker

Directly start from our configured docker image by docker pull russellrobin/bunny:latest.

<details> <summary>Expand to see how to keep codes up to date.</summary> Although this docker is under regular maintenance by us, local Bunny codes aren't guaranteed to be kept up to date with our GitHub repo. You may want to:
  1. Run pip install --upgrade transformers && cd Bunny in a running container,

  2. Set default GitHub identity by git config user.email "you@example.com" && git config user.name "Your Name",

  3. Update Bunny local codes using git pull.

  4. pip install -e .

You are all set!

</details>

Local Installation

Training

Bunny training consists of two stages: (1) pretrain stage: use data to connect a frozen pretrained vision encoder to a frozen LLM, and only the connector is trained; (2) visual instruction tuning stage: use data to teach the model to follow multimodal instructions, where the connector, learnable LLM parameters and vision encoder (optional) are updated.

Bunny is trained on 8 A100 GPUs. Under other circumstances, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: global_batch_size = per_device_train_batch_size $\times$ gradient_accumulation_steps $\times$ num_gpus.

Support Models

Currently, we support several vision encoders and LLMs.

For vision encoders, we support CLIP, EVA-CLIP and SigLIP.

Vision EncodersDownload Link
clip-vit-large-patch14-336openai/clip-vit-large-patch14-336
EVA02_CLIP_L_336_psz14_s6BQuanSun/EVA-CLIP
siglip-so400m-patch14-384google/siglip-so400m-patch14-384

For LLMs, we support phi-1.5, stablelm-2, qwen1.5-1.8b, minicpm, phi-2, phi-3 and llama3-8b.

MODEL_TYPELLMDownload Link
phi-1.5phi-1_5microsoft/phi-1_5
stablelm-2stablelm-2-1_6bstabilityai/stablelm-2-1_6b
qwen1.5-1.8bQwen1.5-1.8BQwen/Qwen1.5-1.8B
minicpmMiniCPM-2Bopenbmb/MiniCPM-2B-history (step 280000)
phi-2phi-2microsoft/phi-2
phi-3Phi-3-mini-4k-instructmicrosoft/Phi-3-mini-4k-instruct
llama3-8bMeta-Llama-3-8B-Instructmeta-llama/Meta-Llama-3-8B-Instruct

Note that there are many variants of above models. We build and test our code based on the exact versions mentioned above. More models will be supported in the future!

Pretrain

Visual Instruction Tuning

Continuous Fine-tuning

If you want to continuously fine-tuning our released Bunny models on your data or to adapt certain task,

<details> <summary>expand to see the instructions.</summary>
  1. Prepare data: convert your data to a JSON file of a list of all samples with the format like Bunny-695K.

  2. Prepare model:

    • download Bunny models and if only LoRA provided, merge the LoRA weights and base LLM

      python script/merge_lora_weights.py \
        --model-path /path/to/bunny_lora_weights \
        --model-base /path/to/base_llm_model \
        --model-type phi-2 (or stablelm-2 or phi-1.5 or qwen1.5-1.8b or minicpm or phi-3 or llama3-8b) \
        --save-model-path /path/to/merged_model
      
    • add "continuous_training": true in /path/to/merged_model/config.json to ensure loading the vision tower from merged weights

  3. Edit script: both finetune_full.sh and finetune_lora.sh can be used, before:

    • change --model_name_or_path to /path/to/merged_model

    • delete --pretrain_mm_mlp_adapter because we load the cross-modality projector from merged weights

    • customize the hyperparameters, e.g. the learning rate, to fit your dataset

    • for MODEL_TYPE = minicpm/phi-3/llama3-8b, change --version to minicpm/phi3/llama, too. S$^2$-Wrapper would be enabled if --use_s2 True added. The vision encoder would be tuned if --unfreeze_vision_tower True added.

Note that if you continuously fine-tune Bunny models using LoRA, --model-base should be Bunny models rather than the original LLMs when loading.

</details>

Demo

Gradio Web UI

CLI Inference (Without Gradio Interface)

For CLI-based inference without using the Gradio interface, use the following command:

You can also control temperature, repetition-penalty and max-new-tokens.

Evaluation

For full-parameter tuning models, see evaluation_full.md.

For LoRA tuning models, see evaluation_lora.md.

Citation

If you find this repository helpful, please cite the paper below.

@article{he2024bunny,
      title={Efficient Multimodal Learning from Data-centric Perspective}, 
      author={He, Muyang and Liu, Yexin and Wu, Boya and Yuan, Jianhao and Wang, Yueze and Huang, Tiejun and Zhao, Bo},
      journal={arXiv preprint arXiv:2402.11530},
      year={2024}
}

License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.

Acknowledgement

We build our project based on LLaVA: Large Language and Vision Assistant.