Home

Awesome

<img src="figures/moai_emoji.png" style="vertical-align: -10px;" :height="50px" width="50px"> MoAI: Mixture of All Intelligence for Large Language and Vision Models [ArXiv]

πŸ“° News

a

🎨 In-Progress

Official PyTorch implementation code for realizing the technical part of Mixture of All Intelligence (MoAI) to improve performance of numerous zero-shot vision language tasks. This code is developed on two baseline codes of XDecoder: Generalized Decoding for Pixel, Image, and Language accepted in CVPR 2023 and InternLM for Technical Paper. Please understand the combined code in the current version combining two technical code implementation!

πŸ“– Citation

@article{lee2024moai,
  title={MoAI: Mixture of All Intelligence for Large Language and Vision Models},
  author={Lee, Byung-Kwan and Park, Beomchan and Kim, Chae Won and Ro, Yong Man},
  journal={arXiv preprint arXiv:2403.07508},
  year={2024}
}

🏝️ Summary

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (<img src="figures/moai_emoji.png" style="vertical-align: -5px;" :height="20px" width="20px"> MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligenceβ€”(1) visual features, (2) auxiliary features from the external CV models, and (3) language featuresβ€”utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

πŸš€ Highlights

<img src="figures/figure_performance.png" width="730" height="400"> <figcaption> Figure. Comparing the scores and accuracies of numerous VL benchmarks for various open-source and closed-source LLVMs with those for MoAI. </figcaption>
<img src="figures/figure_moai_arch.png" width="855" height="400"> <figcaption> Figure. Overview of MoAI architecture. Compressed learnable tokens, the parameters of MoAI-Compressor and MoAI-Mixer are learned. `Vision' represents vision encoder to embed visual features and ice/fire symbols represent the modules to freeze or learn. Note that, 'Word Embed' represents the word embedding dictionary of MLM. </figcaption>
<img src="figures/figure_scale.png" width="1097" height="400"> <figcaption> Table. Illustrating zero-shot vision language performances (a) by model size scale compared with the larger open-source LLVMs: LLaVA1.6-13B and -34B, in the latest, and closed-source LLVMs. (b) shows the results of POPE and HallusionBench~, where `Adversarial', `Random', and `Popular' are metrics in POPE. Note the dot points for closed-source LLVMs represent averaged performances with them. </figcaption>

Download <img src="figures/moai_emoji.png" style="vertical-align: -2px;" :height="20px" width="20px"> MoAI-7B

Q-BenchSQA-IMGTextVQAPOPEMME-PMME-CMM-BenchMMB-CNMM-Vet
InstructBLIP-7B56.749.260.550.1--36.023.725.6
Qwen-VL-7B59.467.163.8---38.27.4-
LLaVA1.5-7B58.766.858.285.9151129464.358.330.5
MoAI-7B70.283.567.887.1171456179.376.543.7

Interesting Questions for Architecture Choices [source]

πŸ“‚ Directory Layout

.
β”œβ”€β”€ asset                           # Required package lists (Important)
β”œβ”€β”€ trainer                         # Training MoAI and initializing optimizer (Not Support Now)
β”œβ”€β”€ utils                           # Michallengeous util files (Not important)
β”œβ”€β”€ moai                            # MoAI architecture & loading moai (Important)
β”œβ”€β”€ pipeline                        # Evaluating zero-shot vision language tasks (Important)
β”‚
β”œβ”€β”€ datasets                        # Important
β”‚   β”œβ”€β”€ dataset_mappers             # data parsing including augmentation for loader
β”‚   β”œβ”€β”€ evaluation                  # measuring evaluation for each dataset 
β”‚   └── registration                # register dataset
β”‚
β”œβ”€β”€ configs                         
β”‚   β”œβ”€β”€ accel                       # Accelerate Config files (Support Deepspeed, DDP, Multinode)
β”‚   └── moai_eval.yaml              # Evaluating MoAI
β”‚
β”œβ”€β”€ modeling                        # Not Important
β”‚   β”œβ”€β”€ architectures               # training the prototype of moai (Not Support Now)
β”‚   β”œβ”€β”€ utils                       # utils for modeling (Not important)
β”‚   └── BaseModel                   # loading and saving model (Important)
β”‚
β”œβ”€β”€ lbk_entry.py                    # main code of control tower (Important)
β”œβ”€β”€ run                             # bash file for running the evaluation (Important)
β”‚
β”œβ”€β”€ install                         # install required packages (Important)
└── README.md

πŸ’‘ How to Run?

In bash file of install, you should first run the following lines.

conda create -n moai python=3.9
conda activate moai
conda clean -a && pip cache purge
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r assets/requirements/requirements.txt
pip install -r assets/requirements/requirements_custom.txt
pip install flash-attn --no-build-isolation

In addition, you should set the following environment variables to set the dataset path.

export DETECTRON2_DATASETS=/path/to/dataset
export DATASET=/path/to/dataset
export DATASET2=/path/to/dataset
export VLDATASET=/path/to/dataset

You should make directory 'checkpoints' in moai/sgg and upload checkpoint of Scene Graph Generation after downloading it, where its checkpoint filename should be 'psgtr_r50_epoch_60.pth'

Download checkpoints with labeled name 'PSGTR' in Panoptic SGG. Or, download checkpoints in my google drive Google Drive.

At init_detector function in mmdet/apis/inference.py, line 95-110 should be commented to get compatibility.

# if palette != 'none':
#     model.dataset_meta['palette'] = palette
# else:
#     test_dataset_cfg = copy.deepcopy(config.test_dataloader.dataset)
#     # lazy init. We only need the metainfo.
#     test_dataset_cfg['lazy_init'] = True
#     metainfo = DATASETS.build(test_dataset_cfg).metainfo
#     cfg_palette = metainfo.get('palette', None)
#     if cfg_palette is not None:
#         model.dataset_meta['palette'] = cfg_palette
#     else:
#         if 'palette' not in model.dataset_meta:
#             warnings.warn(
#                 'palette does not exist, random is used by default. '
#                 'You can also set the palette to customize.')
#             model.dataset_meta['palette'] = 'random'

At inference_detector function in mmdet/apis/inference.py, line 179- should be changed by the following lines.

# build the data pipeline
data_ = test_pipeline(data_)

data_['inputs'] = data_['inputs'].unsqueeze(0)
data_['data_samples'] = [data_['data_samples']]

# forward the model
with torch.no_grad():
    results = model.test_step(data_)[0]

In mmcv/transforms/processing.py, line 388 should be commented to get compatibility.

# results['img_shape'] = padded_img.shape[:2]

Download MoAI Model and then run the demo script,

"""
MoAI-7B

Simple Six Steps
"""

# [1] Loading Image
from PIL import Image
from torchvision.transforms import Resize
from torchvision.transforms.functional import pil_to_tensor
image_path = "figures/moai_mystery.png"
image = Resize(size=(490, 490), antialias=False)(pil_to_tensor(Image.open(image_path)))

# [2] Instruction Prompt
prompt = "Describe this image in detail."

# [3] Loading MoAI
from moai.load_moai import prepare_moai
moai_model, moai_processor, seg_model, seg_processor, od_model, od_processor, sgg_model, ocr_model \
    = prepare_moai(moai_path='/mnt/ssd/lbk-cvpr/MoAI/final', bits=4, grad_ckpt=False, lora=False, dtype='fp16')

# [4] Pre-processing for MoAI
moai_inputs = moai_model.demo_process(image=image, 
                                    prompt=prompt, 
                                    processor=moai_processor,
                                    seg_model=seg_model,
                                    seg_processor=seg_processor,
                                    od_model=od_model,
                                    od_processor=od_processor,
                                    sgg_model=sgg_model,
                                    ocr_model=ocr_model,
                                    device='cuda:0')

# [5] Generate
import torch
with torch.inference_mode():
    generate_ids = moai_model.generate(**moai_inputs, do_sample=True, temperature=0.9, top_p=0.95, max_new_tokens=256, use_cache=True)

# [6] Decoding
answer = moai_processor.batch_decode(generate_ids, skip_special_tokens=True)[0].split('[U')[0]
print(answer)

If you want to valiate zero-shot performances in numerous datasets, then running the bash file 'run'.

GPU_DEVICE="0,1,2,3,4,5"
length=${#GPU_DEVICE}
n_gpu=$(((length+1)/2))
main_port=10000
test_batch=1 # (Must be Necessary)

CUDA_VISIBLE_DEVICES=$GPU_DEVICE \
accelerate launch --config_file configs/accel/ddp_accel.yaml \
    --num_processes=$n_gpu \
    --main_process_port=$main_port \
    lbk_entry.py eval \
    --conf_files configs/moai_eval.yaml \
    --overrides \
    WANDB False \
    DATASETS.TEST mme \
    PIPELINE MMEPipeline \
    MME.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    SCIENCEQA.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    POPE.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MMBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MMVET.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    AI2D.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    HALLUSIONBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MATHVISTA.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    QBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    SEED.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    SAVE_DIR /path/to/MoAI_DIR \
    WEIGHT True \
    RESUME_FROM /path/to/MoAI_WEIGHT \

Note that, you should change the two parts to evaluate the dataset you want. (This is very important!!)

DATASETS.TEST

PIPELINE

GPT-4 Aid Evalution for AI2D, MM-Vet, SEED

This code will be soon public!

πŸ… Download Datasets

πŸ“‚ Dataset Directory (/path/to/dataset)

.
β”œβ”€β”€ LLVisionQA-QBench               # Q-Bench
β”œβ”€β”€ ScienceQA                       # SQA-IMG
β”œβ”€β”€ TextVQA                         # TextVQA
β”œβ”€β”€ POPE                            # POPE
β”œβ”€β”€ MME_Benchmark_release_version   # MME
β”œβ”€β”€ MMBench                         # MM-Bench
β”œβ”€β”€ mm-vet                          # MM-Vet
β”œβ”€β”€ MathVista                       # MathVista
β”œβ”€β”€ SEED-Bench                      # SEED-IMG
β”œβ”€β”€ ai2d                            # AI2D
└── HallusionBench                  # HallusionBench