Awesome
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
This repository provides the official PyTorch implementation of the following paper:
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate <br> Qidong Huang<sup>1,2</sup>, Xiaoyi Dong<sup>2,3</sup>, Pan Zhang<sup>2</sup>, Yuhang Zang <sup>2</sup>, Yuhang Cao <sup>2</sup>, Jiaqi Wang<sup>2</sup>, Dahua Lin<sup>2</sup>, Weiming Zhang<sup>1</sup>, Nenghai Yu<sup>1</sup> <br> <sup>1</sup>University of Science and Technology of China, <sup>2</sup>Shanghai AI Laboratory, <sup>3</sup>The Chinese University of Hong Kong <br>
🎯 News
[2024.10.10] 🚀 We release the paper at ArXiv and HuggingFace!
[2024.10.10] 🚀 This project page has been built!
👨💻 Todo
- Release the code of MIR
- Release the training code and evaluation code of MoCa
- Release the checkpoints of MoCa
⭐️ TL;DR
1. For MIR
If you just want to use MIR as the pre-training indicator of your own model, no additional environment is required.
- Ensure the packages such as
torch
,numpy
, andscipy
are installed. - Replace the model preprocessing and generation in
mir.py
with your own model's code, we display LLaVA's code as the reference. - Specify the input args and run the command:
python mir.py --model_path PATH/TO/MODEL --base_llm PATH/TO/LLM --text_data_path PATH/TO/TEXT/DATA --image_data_path PATH/TO/VISION/DATA --eval_num 100 --mode fast
Note that base_llm
is not required if you train the base LLM during pre-training and include its ckpt in the model_path
.
You can also adjust the args to the intialization style of your model.
2. For MoCa
If you just want to use MoCa on your own model, we recommand you to following the steps below:
- Copy the code of MoCa module into the modeling code of your own model and ensure MoCa is equipped by the base LLM layer in both initialization and forward functions.
- Make sure that the input preprocessing can compute the
modality_mask
, please refer to Line183-184, Line269-276 and Line373-382 inllava/model/llava_arch.py
. Also, make sure that themodality_mask
can be successsfully delivered into the model forward pass, e.g., adding it as the formal parameter of each forward function, like Line70, Line88, Line96, Line106, Line127, Line137, Line145, Line157, Line166, Line174-175 inllava/model/language_model/llava_llama.py
. - Check some details to support the usage of
use_moca=True
, such as (it is recommanded to searchuse_moca
in this repo to find which places should be revised): 1)Add it into the model config (here). 2) Add it into training arguments (here). 3) Unlock it during training (here). 4) Ensure the correct checkpoint saving (here1, here2, here3). - Add
--use_moca
when running the training command to enable the usage of MoCa.
📜 Setup
If you want to use our codebase (modified on LLaVA) for reproduction, you are recommanded to build a new environment though the steps below. The following steps are just listed for Linux. If you are using macOS or Windows, please refer to LLaVA.
- Clone this repository and navigate to Modality-Integration-Rate folder
git clone https://github.com/shikiw/Modality-Integration-Rate.git
cd Modality-Integration-Rate
- Install Package
conda create -n llava python=3.10 -y
conda activate llava
python -m pip install --upgrade pip # enable PEP 660 support
python -m pip install -e .
python -m pip install -e transformers-4.37.2
- Install additional packages for training cases
pythom -m pip install -e ".[train]"
pythom -m pip install flash-attn --no-build-isolation
MIR
To reproduce the MIR implementation on this codebase, you can follow these steps:
- Specify the
text_data_path
andimage_data_path
for MIR calculation. You can also specify them like Line55-64 inmir.py
, using TextVQA val images and CNN/DM text by default, i.e.,- Download TextVQA_0.5.1_val.json and images and extract to
PATH/TO/VISION/DATA
. - Download CNN stories and extract to
PATH/TO/TEXT/DATA
. - Modify Line55-64 with the text data path and image data path.
- Download TextVQA_0.5.1_val.json and images and extract to
- If you pre-train only MLP, run this command:
python mir.py --model_path PATH/TO/MODEL --base_llm PATH/TO/LLM --eval_num 100 --mode fast
- If your pre-train any part of ViT or base LLM, run this command:
python mir.py --model_path PATH/TO/MODEL --eval_num 100 --mode fast
MoCa
Our codebase supports --use_moca
to activate the implementation of MoCa. Check out scripts/v1_5/pre_sft_moca.sh
for more details.
Model | Size | Schedule | Average | MMStar | MME | MMB | MMB-CN | SEED-IMG | TextVQA | MM-Vet | POPE | GQA |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5 | 7B | full_ft-1e | 59.1 | 30.3 | 1510.7 | 64.3 | 58.3 | 66.1 | 58.2 | 31.1 | 85.9 | 62.0 |
+MoCa | 7B | full_ft-1e | 60.6 | 36.5 | 1481.0 | 66.8 | 60.0 | 67.0 | 58.7 | 32.2 | 86.9 | 62.8 |
The pretrained and finetuned checkpoints are released.
Train
This codebase is based on LLaVA and ShareGPT4V, where we introduce some new features and now it supports the following inputs in the launch script:
--tune_vision_tower
and--tune_vit_from_layer
--tune_language_model
and--tune_llm_utill_layer
--tune_entire_model
--data_scale
--use_moca
and--moca_std
Some cases for reference:
- To pre-train the model with the customized data scale (e.g., 200K):
sh scripts/v1_5/pre_data_scale.sh
- To pre-train the model (unlock the 13-24 layer of ViT and the 1-16 layer of base LLM), and SFT (unlock entire LLM by default):
sh scripts/v1_5/pre_unlock_vit-12_llm-16_sft.sh
- To pre-train the model (unlock the 13-24 layer of ViT and the entire base LLM), and SFT (unlock entire LLM by default):
sh scripts/v1_5/pre_unlock_vit-12_llm-all_sft.sh
- To apply MoCa in training:
sh scripts/v1_5/pre_sft_moca.sh
Evaluation
We follow the original evaluation in LLaVA for most of benchmarks. For MMStar, we use VLMEvalKit.
See Evaluation.md.
Acknowledgement
This repo is based on the codebase of LLaVA and ShareGPT4V. Thanks for their impressive works!
Citation
If you find this work useful for your research, please cite our paper:
@article{huang2024deciphering,
title={Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate},
author={Huang, Qidong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and Wang, Jiaqi and Lin, Dahua and Zhang, Weiming and Yu, Nenghai},
journal={arXiv preprint arXiv:2410.07167},
year={2024}
}