Awesome

<div align="center"> <h2><a href="https://github.com/alibaba/conv-llava">ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models</a></h2>

Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao

Jun Song, Shiji Song, Gao Huang, Bo Zheng

</div> <p align="center"> <a href="http://arxiv.org/abs/2405.15738"> <img src="https://img.shields.io/badge/arXiv-2405.15738-b31b1b.svg?logo=arXiv"> </a> <a href="https://huggingface.co/collections/ConvLLaVA/convllava-66519ef0ccdee62544bd19bf"> <img src="https://img.shields.io/badge/🤗%20Hugging%20Face-Models-ffd21e"> </a> <a href="https://huggingface.co/papers/2405.15738"> <img src="https://img.shields.io/badge/🤗%20Hugging%20Face-Paper-ffd21e"> </a> <a href="https://modelscope.cn/organization/ConvLLaVA?tab=model"> <img src="https://img.shields.io/badge/🤖%20ModelScope-Models-5f4cf2.svg"> </a> <a href="https://wisemodel.cn/organization/ConvLLaVA"> <img src="https://img.shields.io/badge/WiseModel-Models-571282.svg"> </a> <a href="https://github.com/alibaba/conv-llava/blob/main/asset/WeChat.png"> <img src="https://img.shields.io/badge/WeChat-Group-5ef27f.svg"> </a> <a href="https://github.com/alibaba/conv-llava/stargazers"> <img alt="GitHub stars" src="https://img.shields.io/github/stars/alibaba/conv-llava?color=ccf" /> </a> </p>

<span>[ English | <a href="README_zh.md">中文</a> ]</span>

Abstract

High-resolution Large Multimodel Models (LMM) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively avoiding the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations.

Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to merge the gap.
Furthermore, since ConvNeXt's original compression ratio is insufficient for much higher resolution inputs, we train a successive stage to further compress the visual tokens, effectively reducing redundancy.

These optimizations enable ConvLLaVA to support inputs of 1536x1536 resolution while generating only 576 visual tokens, accommodating images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks.

<div align="center"> <img src="asset/method.png" width=600" /> </div> <div align="center"> <figcaption>Comparison between LLaVA and ConvLLaVA.</figcaption> </div>

Release :loudspeaker:

2024/05/25: Checkpoints are released.
2024/04/17: Our code is released.

If you are interested in Large Multimodal Models or you have great ideas, please feel free to email with me: Chunjiang Ge.

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Abstract
Release :loudspeaker:
Contents
TODO
Install
Model Zoo
Dataset
Train
Evaluation
Citation
Acknowledgement

TODO

Install

Clone this repository and navigate to ConvLLaVA folder

git clone https://github.com/alibaba/conv-llava
cd conv-llava

Install Package

conda create -n convllava python=3.11 -y
conda activate convllava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Model Zoo

The performance on mainstream benchmarks are shown below:

<table class="tg"><thead> <tr> <th class="tg-nrix">Method</th> <th class="tg-nrix">Resolution</th> <th class="tg-nrix">Visual Tokens</th> <th class="tg-nrix">LLM</th> <th class="tg-nrix">MME</th> <th class="tg-nrix">MMB</th> <th class="tg-nrix">SEED</th> <th class="tg-nrix">RealWorldQA</th> <th class="tg-nrix">MMMU</th> <th class="tg-nrix">MMVet</th> <th class="tg-nrix">Text</th> <th class="tg-nrix">Doc</th> <th class="tg-nrix">POPE</th> </tr></thead> <tbody> <tr> <td class="tg-nrix">ConvLLaVA</td> <td class="tg-nrix">768</td> <td class="tg-nrix">144</td> <td class="tg-nrix">7B</td> <td class="tg-nrix">1541</td> <td class="tg-nrix">68</td> <td class="tg-nrix">68.8</td> <td class="tg-nrix">55.9</td> <td class="tg-nrix">36.3</td> <td class="tg-nrix">44.8</td> <td class="tg-nrix">59.1</td> <td class="tg-nrix">44.8</td> <td class="tg-nrix">87.3</td> </tr> <tr> <td class="tg-nrix">ConvLLaVA</td> <td class="tg-nrix">1024</td> <td class="tg-nrix">256</td> <td class="tg-nrix">7B</td> <td class="tg-nrix">1553</td> <td class="tg-nrix">68.8</td> <td class="tg-nrix">69.3</td> <td class="tg-nrix">58.8</td> <td class="tg-nrix">35.1</td> <td class="tg-nrix">44.4</td> <td class="tg-nrix">62.5</td> <td class="tg-nrix">48.5</td> <td class="tg-nrix">87.7</td> </tr> <tr> <td class="tg-nrix">ConvLLaVA</td> <td class="tg-nrix">1536</td> <td class="tg-nrix">576</td> <td class="tg-nrix">7B</td> <td class="tg-nrix">1575</td> <td class="tg-nrix">68.7</td> <td class="tg-nrix">70.2</td> <td class="tg-nrix">59.9</td> <td class="tg-nrix">35.8</td> <td class="tg-nrix">45.9</td> <td class="tg-nrix">65.8</td> <td class="tg-nrix">59</td> <td class="tg-nrix">87.3</td> </tr> </tbody></table> <table class="tg"><thead> <tr> <th class="tg-nrix" rowspan="2">Method</th> <th class="tg-nrix" rowspan="2">Resolution</th> <th class="tg-nrix" rowspan="2">Visual Tokens</th> <th class="tg-nrix" rowspan="2">LLM</th> <th class="tg-nrix" colspan="3">RefCOCO</th> <th class="tg-nrix" colspan="3">RefCOCO+</th> <th class="tg-nrix" colspan="2">RefCOCOg</th> <th class="tg-nrix" rowspan="2">Avg</th> </tr> <tr> <th class="tg-nrix">val</th> <th class="tg-nrix">test-A</th> <th class="tg-nrix">test-B</th> <th class="tg-nrix">val</th> <th class="tg-nrix">test-A</th> <th class="tg-nrix">test-B</th> <th class="tg-nrix">val</th> <th class="tg-nrix">test</th> </tr></thead> <tbody> <tr> <td class="tg-nrix">ConvLLaVA</td> <td class="tg-nrix">768</td> <td class="tg-nrix">144</td> <td class="tg-nrix">7B</td> <td class="tg-nrix">84.5</td> <td class="tg-nrix">89.0</td> <td class="tg-nrix">79.2</td> <td class="tg-nrix">77.7</td> <td class="tg-nrix">84.9</td> <td class="tg-nrix">69.7</td> <td class="tg-nrix">79.8</td> <td class="tg-nrix">79.7</td> <td class="tg-nrix">80.6</td> </tr> <tr> <td class="tg-nrix">ConvLLaVA</td> <td class="tg-nrix">1024</td> <td class="tg-nrix">256</td> <td class="tg-nrix">7B</td> <td class="tg-nrix">85.5</td> <td class="tg-nrix">89.6</td> <td class="tg-nrix">78.8</td> <td class="tg-nrix">79.3</td> <td class="tg-nrix">86.1</td> <td class="tg-nrix">70.3</td> <td class="tg-nrix">80.6</td> <td class="tg-nrix">81.2</td> <td class="tg-nrix">81.4</td> </tr> <tr> <td class="tg-nrix">ConvLLaVA</td> <td class="tg-nrix">1536</td> <td class="tg-nrix">576</td> <td class="tg-nrix">7B</td> <td class="tg-nrix">86.5</td> <td class="tg-nrix">90.6</td> <td class="tg-nrix">80.5</td> <td class="tg-nrix">80.0</td> <td class="tg-nrix">86.8</td> <td class="tg-nrix">71.5</td> <td class="tg-nrix">82.0</td> <td class="tg-nrix">82.4</td> <td class="tg-nrix">82.3</td> </tr> </tbody></table>

Please check out our Model Zoo for all public ConvLLaVA checkpoints, and the instructions of how to use the weights.

Dataset

Data we use is introduded in Data.md.

Train

We use the following hyperparameters for training ConvLLaVA.

Hyperparameters	Stage 1	Stage 2	Stage 3
Learning Rate	3e-4	2e-5	2e-5
Batch Size	256	256	128
Epochs	1	1	1
Warmup Ratio	0.03	0.03	0.03
Weight Decay	0	0	0
Optimizer	AdamW	AdamW	AdamW

The training scripts are in the scripts:

Projector Initialzation: stage1
Vision Language Pretraining: stage2
Instruction Tuning: stage3

Evaluation

We support VLMEVALKIT and lmms-eval to evaluate our model now. See Evaluation.md for more details.

Citation

If you find LLaVA useful for your research and applications, please cite using this BibTeX:

@misc{ge2024convllava,
    title={ConvLLaVA: Hierarchical Backbones as Visual
Encoder for Large Multimodal Models},
    author={Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
    year={2024}
    eprint={2045.15738},
}

Acknowledgement

Vicuna: the codebase LLaVA built upon, and our base model Vicuna-13B that has the amazing language capabilities!
LLaVA: the codebase we built upon.