Awesome

🌋 INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

This repository contains the Pytorch code and model weight of INF-LLaVA, a novel MLLM designed for high-resolution image perception and reasoning.

INF-LLaVA has the following features to process high-resolution images:

Dual-perspective Cropping Module(DCM) : Integrate both global and local perspectives when cropping high-resolution images into subimages. This enhances the model’s ability to capture detailed and contextual information.
Dual-perspective Enhancement Module(DEM) : An effective and efficient module for fusing dual-perspective features, resulting in dual-enhanced features that significantly improve performance.
Strong Performance : INF-LLaVA outperforms existing models on multiple benchmarks, demonstrating the effectiveness of our approach. Check out our model zoo.

News !!

🔥[2024-7-19] Release the ckpt model of INF-LLaVA on Hugging Face.
🔥[2024-7-16] Release the code of INF-LLaVA.

To-Do Lists

Release INF-LLaVA model based on Llama 3.1
Release INF-LLaVA Strong Models.
Release INF-LLaVA training code.

Install
Train
Evaluate
Model Zoo

Install

Clone this repository and navigate to INF-LLaVA folder

git clone https://github.com/WeihuangLin/INF-LLaVA.git
cd INF-LLaVA

Install Package

conda create -n inf-llava python=3.10 -y
conda activate inf-llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir

Train

Pre-train

cd INF-LLaVA
bash INF-LLava_pretrain.sh

Note: You should replace the data_path and image_folder in the INF-LLava_pretrain.sh

Finetune

cd INF-LLaVA
bash INF-LLava_finetune.sh

Note: You should replace the data_path and image_folder in the INF-LLava_finetune.sh

You can download our pretrained weights in Model Zoo

Evaluate

We follow lmm-eval to conduct evaluations. Please refer to lmm-eval for help. We provide the same script to complete the testing.

Model Zoo

Version	Checkpoint
$INF-LLaVA$	🤗WeihuangLin/INF-LLaVA-sft
$INF^*-LLaVA$	🤗WeihuangLin/INF_star-LLaVA-sft

$INF^*-LLaVA$ means using a more diverse dataset for training.

🎫 License

This project is released under the Apache 2.0 license.

🖊️ Citation

If you find this project useful in your research, please consider cite:



@misc{ma2024infllava,
      title={INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model}, 
      author={Yiwei Ma and Zhibin Wang and Xiaoshuai Sun and Weihuang Lin and Qiang Zhou and Jiayi Ji and Rongrong Ji},
      journal={arXiv preprint arXiv:2407.16198},
      year={2024}
}

🙏 Acknowledgement

We are thankful to LLaVA, lmms-eval and LLama3 for releasing their models and code as open-source contributions.

In case if you face any issues or have any questions, please feel free to create an issue.