Home

Awesome

<p align="center"> <img src="./images/INF-LLaVA.png" width="250" style="margin-bottom: 0.2;"/> <p> <p align="center"> <a href='https://www.arxiv.org/abs/2407.16198'> <img src='https://img.shields.io/badge/Paper-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'> </a> <a href='https://huggingface.co/collections/WeihuangLin/inf-llava-669be442004e418e71fea201' style='padding-left: 0.5rem;'> <img src='https://img.shields.io/badge/Huggingface%20Model-8A2BE2' alt='Project Page'> </a> </p>

πŸŒ‹ INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

This repository contains the Pytorch code and model weight of INF-LLaVA, a novel MLLM designed for high-resolution image perception and reasoning.

INF-LLaVA has the following features to process high-resolution images:

News !!

To-Do Lists

Table of Contents

Install

  1. Clone this repository and navigate to INF-LLaVA folder
git clone https://github.com/WeihuangLin/INF-LLaVA.git
cd INF-LLaVA
  1. Install Package
conda create -n inf-llava python=3.10 -y
conda activate inf-llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir

Train

  1. Pre-train
cd INF-LLaVA
bash INF-LLava_pretrain.sh

Note: You should replace the data_path and image_folder in the INF-LLava_pretrain.sh

  1. Finetune
cd INF-LLaVA
bash INF-LLava_finetune.sh

Note: You should replace the data_path and image_folder in the INF-LLava_finetune.sh

You can download our pretrained weights in Model Zoo

Evaluate

We follow lmm-eval to conduct evaluations. Please refer to lmm-eval for help. We provide the same script to complete the testing.

Model Zoo

VersionCheckpoint
$INF-LLaVA$πŸ€—WeihuangLin/INF-LLaVA-sft
$INF^*-LLaVA$πŸ€—WeihuangLin/INF_star-LLaVA-sft

$INF^*-LLaVA$ means using a more diverse dataset for training.

🎫 License

This project is released under the Apache 2.0 license.

πŸ–ŠοΈ Citation

If you find this project useful in your research, please consider cite:



@misc{ma2024infllava,
      title={INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model}, 
      author={Yiwei Ma and Zhibin Wang and Xiaoshuai Sun and Weihuang Lin and Qiang Zhou and Jiayi Ji and Rongrong Ji},
      journal={arXiv preprint arXiv:2407.16198},
      year={2024}
}

πŸ™ Acknowledgement

We are thankful to LLaVA, lmms-eval and LLama3 for releasing their models and code as open-source contributions.

In case if you face any issues or have any questions, please feel free to create an issue.