Home

Awesome

<p align="center" width="100%"> <img src="assets/Logo.png" width="80%" height="80%"> </p> <!-- # EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders -->

Code License Model License

[arXiv] [HuggingFace] [Demo] [Model Zoo] [Data]

Introduction

Eagle is a family of Vision-Centric High-Resolution Multimodal LLMs. It presents a thorough exploration to strengthen multimodal LLM perception with a mixture of vision encoders and different input resolutions. The model contains a channel-concatenation-based "CLIP+X" fusion for vision experts with different architectures (ViT/ConvNets) and knowledge (detection/segmentation/OCR/SSL). The resulting family of Eagle models support up to over 1K input resolution and obtain strong results on multimodal LLM benchmarks, especially resolution-sensitive tasks such as optical character recognition and document understanding.

<div align="center"> <img src="assets/fig-teaser.jpg" width="90%"> </div>

Updates

Contents

Models & Performance

Models trained on the LLaVA-1.5 Pre-train and Eagle-SFT-1.8M data are available to download here.

Model             LLM             Pretrain       SFTGQAMMEMMMU(Val)OCRSQA(I)POPETextVQAInfoVQAVizWizSEED(I)VQAv2MathVistaMMBenchChartQADocVQA
Eagle-X4-7BVicuna-7BLLaVA-v1.51.8M64.8156134.954070.588.470.947.450.873.483.437.367.867.578.8
Eagle-X5-7BVicuna-7BLLaVA-v1.51.8M64.9152836.352969.888.871.247.454.473.983.437.068.467.878.6
Eagle-X4-13BVicuna-13BLLaVA-v1.51.8M66.3162736.956173.187.773.950.756.274.483.837.669.970.579.9
Eagle-X5-13BVicuna-13BLLaVA-v1.51.8M66.2160936.657472.887.874.251.859.374.183.838.869.269.979.4

Models trained on the Cambrian-1 data are available to download here.

KnowledgeGeneralDocumentVision
LLM              Model                        AvgSQA(I)MMMU(Val)MathVistaAI2DAvgMMEMMBSEED(I)GQAAvgChartQAOCRTextVQADocVQAAvgMMVPRWQA
Llama 3-8BMini-Gemini-HD55.775.137.337.073.572.7160672.773.264.562.959.147.770.274.640.418.762.1
LLaVA-NeXT55.672.841.736.371.672.5160472.172.765.263.969.549.064.672.649.438.760.1
Cambrian-161.380.442.749.073.073.1154775.974.764.671.373.362.471.777.857.651.364.2
Ealge-X4-8B-Plus64.284.343.452.776.173.8155975.976.364.976.680.162.677.186.669.171.666.5
Vicuna-13BMini-Gemini-HD54.171.937.337.070.170.7159768.670.663.760.856.646.670.269.838.419.357.5
LLaVA-NeXT53.773.536.235.170.069.9157570.065.665.462.962.251.467.170.947.636.059.1
Cambrian-160.279.340.048.073.673.7161075.774.464.371.373.861.972.876.852.241.363.0
Ealge-X4-13B-Plus63.082.041.054.474.074.6165175.774.865.375.177.661.975.585.461.458.064.8
Yi-34BMini-Gemini-HD62.477.748.043.480.576.2165980.675.365.868.167.651.874.178.952.337.367.2
LLaVA-NeXT62.581.846.746.574.976.0163379.375.967.167.768.754.569.578.154.247.361.0
Cambrian-167.085.649.753.279.776.8168981.475.365.871.975.660.076.775.560.352.767.8
Ealge-X5-34B-Plus68.685.551.857.979.176.3167781.075.664.975.477.262.478.883.068.367.069.5

Visual Examples

Knowledge & General VQA

<div align="center"> <img src="assets/visual/VQA1.png" width="80%"> </div><br> <div align="center"> <img src="assets/visual/VQA2.png" width="80%"> </div><br> <div align="center"> <img src="assets/visual/VQA3.png" width="80%"> </div>

Autonomous Driving

<div align="center"> <img src="assets/visual/AV1.png" width="90%"> </div><br> <div align="center"> <img src="assets/visual/AV2.png" width="90%"> </div>

Infographic, Chart, OCR & Document Understanding

<div align="center"> <img src="assets/visual/Doc1.png" width="80%"> </div><br> <div align="center"> <img src="assets/visual/Doc2.png" width="80%"> </div><br> <div align="center"> <img src="assets/visual/Doc3.png" width="80%"> </div>

Install

Please following the guide here to prepare the environment on Linux OS.

<!-- currently does not support windows and MacOS -->
  1. Clone this repository
git clone https://github.com/NVlabs/EAGLE.git
cd EAGLE
  1. Create environment and install package
conda create -n eagle python=3.10 -y
conda activate eagle
pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt
pip install .
  1. Install additional packages for training cases
pip install flash-attn --no-build-isolation

If you have any questions about the environment setup, please follow the instruction video.

Training Data

Pre-training

We use the same pretraining data as LLaVA v1.5, please download the data from here.

Supervised Fine-tuning

We have compiled all the data and images used in our supervised fine-tuning together. Please download the data from here. After cloning this dataset, please run the following commands to extract all the images:

cd Eagle-1.8M
cat images.tar.part_* > images.tar.gz
tar -xvf images.tar.gz

Please note that while the images have been packaged for convenience, the original dataset licenses remain unchanged. By downloading our data, you agree to the licensing terms of each source dataset. A detailed list of the data sources used in our fine-tuning data mixture is provided below:

VersionDataset NameSample NumberNote
LLaVA v1.5665kMulti-modal conversation
DocVQA39kDocument understanding
synDog-EN50kOCR
ChartQA28kChart understanding
DVQA25kChart understanding
AI2D15kDiagram Understanding
ShareGPT-4V100kDetailed caption generated by GPT-4V
laion-gpt4v *11kDetailed caption generated by GPT-4V
LVIS-Instruct4V220kMulti-modal conversation
LRV-Instruct150kMulti-modal conversation
Geo170k120kMath
LLaVAR20kOCR
Visual7W70kVisual Question Answering
Open-Hermes 2.5300kText
Initial VersionTotal1.8M

* We have done manual inspection to ensure that the dataset does not contain any CSAM content.

To pretrain or fine-tune our model on the Cambrian-1 dataset, please prepare the data according to their instructions. Then, convert the jsonl files into the json file by running the following python code:

import json

source_file = "Cambrian7M_withsystemprompt.jsonl"
dst_file = "Cambrian7M_withsystemprompt.json"

annotations = []
with open(source_file, "r") as f:
    for line in f:
        annotations.append(json.loads(line))
        
with open(dst_file, "w") as f:
    json.dump(annotations, f)

Checkpoint Preparation

Please provide the pretrained model weights for EVA-02 vision tower pretrained on detection task. You can download the checkpoint here and place it in the checkpoints/pretrained_models/ directory.

The weights of other models, including Vicuna, Segment Anything Model, Pix2Struct, ConvNeXt, and CLIP will be automatically downloaded from huggingface during the first run.

Training

The training process for Eagle follows a standard two-stage approach: pretraining and supervised fine-tuning. In the first stage, only the projector's weights are updated. In the second stage, all parameters are fine-tuned. The batch sizes for the pretraining and fine-tuning stages are 256 and 128, respectively. All settings and hyperparameters are identical to those in LLaVA-v1.5 except that we will unfrozen the vision tower's parameters during the second stage.

In default we use 32 NVIDIA A100 80G GPU to conduct the training. Please modify the per_device_train_batch_size and gradient_accumulation_steps if you are using different amount of GPUs.

Pre-training

If you are using a slurm cluster, please use the following command to submit a job.

srun \
    --partition $your_partition \
    --gres "gpu:8" \
    --ntasks_per_node 1 \
    -N 4 \
    --job-name $RUN_NAME \
    "bash $CMD $RUN_NAME"

You can specify the RUN_NAME and CMD variables to run different models according to the following table:

ModelLanguage ModelScript
Eagle-X4Vicuna-7Bscripts/pretrain-eagle-x4-vicuna-7b.sh
Eagle-X4Vicuna-13Bscripts/pretrain-eagle-x4-vicuna-13b.sh
Eagle-X5Vicuna-7Bscripts/pretrain-eagle-x5-vicuna-7b.sh
Eagle-X5Vicuna-13Bscripts/pretrain-eagle-x5-vicuna-13b.sh

Remember to set the $PATH_TO_PRETRAINING_DATA in each script to the downloaded pretraining data. After you have complete the pretraining, you will get a file named mm_projector.bin in the checkpoint folder.

Supervised Fine-tuning

After pretraining is complete, a projector weight file `` will be saved in the checkpoint directory. Please set the $PATH_TO_PRETRAINED_PROJECTOR to the path of this projector weights.

You can use the same sumbit code as the pretraining, and use the script in the following table to launch the supervised fine-tuning.

ModelLanguage ModelScript
Eagle-X4Vicuna-7Bscripts/finetune-eagle-x4-vicuna-7b-1.8m.sh
Eagle-X4Vicuna-13Bscripts/finetune-eagle-x4-vicuna-13b-1.8m.sh
Eagle-X5Vicuna-7Bscripts/finetune-eagle-x5-vicuna-7b-1.8m.sh
Eagle-X5Vicuna-13Bscripts/finetune-eagle-x5-vicuna-13b-1.8m.sh

Before submit the job, you should correctly set the $PATH_TO_SFT_DATA and $PATH_TO_PRETRAINED_PROJECTOR in each script.

Notes

If you have limited GPU resources or memory, please considering the following:

Inference

Our inference code is here. You can set you own 'image_path' here and 'question' here.

import os
import torch
import numpy as np
from eagle import conversation as conversation_lib
from eagle.constants import DEFAULT_IMAGE_TOKEN
from eagle.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from eagle.conversation import conv_templates, SeparatorStyle
from eagle.model.builder import load_pretrained_model
from eagle.utils import disable_torch_init
from eagle.mm_utils import tokenizer_image_token, get_model_name_from_path, process_images, KeywordsStoppingCriteria
from PIL import Image
import argparse
from transformers import TextIteratorStreamer
from threading import Thread

model_path = "NVEagle/Eagle-X5-13B-Chat"
conv_mode = "vicuna_v1"
image_path = "assets/georgia-tech.jpeg"
input_prompt = "Describe this image."

model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path,None,model_name,False,False)
if model.config.mm_use_im_start_end:
    input_prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + input_prompt
else:
    input_prompt = DEFAULT_IMAGE_TOKEN + '\n' + input_prompt

conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], input_prompt)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

image = Image.open(image_path).convert('RGB')
image_tensor = process_images([image], image_processor, model.config)[0]
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')

input_ids = input_ids.to(device='cuda', non_blocking=True)
image_tensor = image_tensor.to(dtype=torch.float16, device='cuda', non_blocking=True)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids.unsqueeze(0),
        images=image_tensor.unsqueeze(0),
        image_sizes=[image.size],
        do_sample=True,
        temperature=0.2,
        top_p=0.5,
        num_beams=1,
        max_new_tokens=256,
        use_cache=True)

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"Image:{image_path} \nPrompt:{input_prompt} \nOutput:{outputs}")

Evaluation

Evaluation with LMMs-Eval

We evaluate MME, MMBench, SEED, MathVista, POPE, ScienceQA, GQA, OCRBench, TextVQA, and ChartQA using LMMs-Eval. For better reproducibility, we have included the specific version we used in this repository. Please follow their guidelines and use the following commands to perform the evaluation:

bash scripts/eval_lmms_eval/eval-mme-seed-mmmu-pope-sqa-gqa-ocrbench-textvqa-chartqa.sh $REPO_ID_OR_LOCAL_PATH $MODEL_NAME $CONV_MODE
# MODEL_NAME can be any name, just to dinstinguish different runs.
# CONV_MODE should be the name of the conversation template during triaining, i.e., "vicuna_v1" for Vicuna, "llama3" for Llama3, and "yi_34b_chatml_direct" for Yi-34B.

Gradio Demo

We set up an online demo here. You can also run this demo on your own machine by running:

python gradio_demo.py \
    --model-path ${MODEL_CKPT}
    --conv-mode vicuna_v1

Citation

If you find this project useful, please cite our work:

@article{shi2024eagle,
    title = {Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders}, 
    author={Min Shi and Fuxiao Liu and Shihao Wang and Shijia Liao and Subhashree Radhakrishnan and De-An Huang and Hongxu Yin and Karan Sapra and Yaser Yacoob and Humphrey Shi and Bryan Catanzaro and Andrew Tao and Jan Kautz and Zhiding Yu and Guilin Liu},
    journal={arXiv:2408.15998},
    year={2024}
}

License

Code License Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for Llama-2, Llama-3, and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Acknowledgement