Home

Awesome

<!-- PROJECT LOGO --> <br /> <div align="center"> <a href="https://github.com/OPPOMKLab/u-LLaVA"> <img src="./images/logo.png" alt="Logo" width="80" height="80"> </a> <h3 align="center">u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model</h3> <p align="center"> Multi-modal multi task LLM <br /> <a href="https://github.com/OPPOMKLab/u-LLaVA/blob/main/README.md"><strong> Documentation</strong></a> | <a href="https://github.com/OPPOMKLab/u-LLaVA/blob/main/README_zh.md"><strong> 中文文档 </strong></a> <br /> <br /> <a href="https://arxiv.org/abs/2311.05348">Paper</a> · <a href="https://github.com/OPPOMKLab/u-LLaVA/issues">Report Bug</a> · <a href="https://github.com/OPPOMKLab/u-LLaVA/issues">Request Feature</a> </p> </div>

🎉 News

<!-- TABLE OF CONTENTS --> <details> <summary>Table of Contents</summary> <ol> <li> <a href="#about-the-project">About The Project</a> <ul> <li><a href="#features">Features</a></li> </ul> </li> <li><a href="#Results">Results</a></li> <li> <a href="#getting-started">Getting Started</a> <ul> <li><a href="#requirements">Requirements</a></li> <li><a href="#datasets">Datasets</a></li> <li><a href="#training">Training</a></li> <li><a href="#evaluation">Evaluation</a></li> </ul> </li> <li><a href="#license">License</a></li> <li><a href="#citation">Citation</a></li> <li><a href="#acknowledgments">Acknowledgments</a></li> </ol> </details> <!-- ABOUT THE PROJECT -->

About The Project

Structure:

<div align="center"> <img src=./images/llm.png width=70%> </div>

Examples

<div align="center"> <img src=./images/exp1.png width=70%> </div> <div align="center"> <img src=./images/exp2.png width=70%> </div> <div align="center"> <img src=./images/exp3.png width=70%> </div> <p align="right">(<a href="#readme-top">back to top</a>)</p>

Demo is coming soon.

<!-- Features -->

Features

Code

Task

<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- Models -->

Model Release

ModelsImages/Videos
u-LLaVAuLLaVA Stage 2
<!-- RESULTS -->

RESULTS

RES

<div align="center"> <img src=./images/res.png width=60%> </div>

REC

<div align="center"> <img src=./images/rec.png width=60%> </div>

SALIENT

<div align="center"> <img src=./images/salient.png width=40%> </div>

General MLLM

Fine-tuneScienceQAMM-BenchSeed-Bench
u-LLaVA-7B87.74soonsoon

Video QA

zero-shotAccuracy (Type 3)
Activity-QA51.70%
<!-- RESULTS -->

Getting Started

<!-- Requirements -->

Requirements

Run the following commands in terminal:

pip install -r ./shells/requirements.txt
cd ./models/GroundingDINO && ./install.sh && cd ../..

Why do these?

  1. install requirements: pip install -r requirements.txt
  2. build cuda core for GroundingDINO: cd ./models/GroundingDINO && ./install.sh && cd ../.., if not may arise UserWarning: Failed to load custom C++ ops. Running on CPU mode Only! warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")
<!-- Datasets -->

Datasets

Annotation download link: ullava modified annotations, LLaVA pretrain annotations and LLaVA finetuning annotaions

Image storage (download link can be found in the table):

image_root
├─ade20k
│  ├─annotations
│  └─images
├─coco2014
│  ├─test2014
│  ├─train2014
│  └─val2014
├─coco2017
│  ├─annotations
│  ├─train2017
│  └─val2017
├─cocostuff
│  ├─train2017
│  └─val2017
├─LLaVA-CC3M-Pretrain-595K
│  └─images
├─saiapr_tc-12
│  ├─00
│  └─01
└─vlpart
    ├─paco
    │  └─annotations
    └─pascal-part
        ├─Annotations_Part
        ├─examples
        └─VOCdevkit

where ade20k is extracted from ADEChallengeData2016.zip and cocostuff is extracted from stuffthingmaps_trainval2017.zip, respectively.

Stage I: Pre-training

DatasetImages/VideosAnnotations
LLaVA CC3MLLaVA-CC3M-Pretrain-595K/image.zipchat.json
TGIFTGIF - Quark Drive tgif.json

Note: We have renamed the TGIF dataset and removed invalid samples to facilitate training, but please follow the original LICENSE.

Stage II: Fine-tuning

DatasetImagesAnnotations
LLaVA Instruction 150Kcoco2017llava_instruct_150k.json
RefCOCOcoco2014refcoco_train.json
RefCOCOgcoco2014refcocog_train.json
RefCOCO+coco2014refcoco+_train.json
RefCLEFsaiapr_tc-12refclef_train.json
ADE20Kade20kade20k.json
COCO Stuffcocostuffcocostuff.json
VOC2010voc2010pascal_part.json
PACO LVISpacopaco_lvis.json
Salient 15Kmsraullava_salinet_15k.json

Note: Please download the images of MSRA-10K and MSRA-B from the official site, thanks the authors for sharing.

Dataset config example

dataset:
  llava:
    data_type: 'image'
    image_token_len: 256
    build_info:
      anno_dir: '/path_to_annotations/llava_instruct_150k.json'
      image_dir: '/path_to_image_root/coco2017/train2017'
      portion: 1.0
    vis_processor: 'clip_image'

  refcoco+:
    data_type: 'image'
    image_token_len: 256
    build_info:
      anno_dir: '/path_to_annotations/refcoco+_train.json'
      image_dir: '/path_to_image_root/coco2014'
      template_root: './datasets/templates/SEG.json'
      portion: 1.0
    vis_processor: 'clip_image'

Note:

  1. We re-organize most of the dataset annotations for easier training, but all of us must follow the rules that the original datasets require.
<!-- Training -->

Training

Stage I: Pre-training

  1. Prepare Open-Source LLaMA models
Foundation modelVersionPath
Vicuna 7B HFV1.1vicuna_7b_v1.1
LLaMA2 7B HF-meta-llama/Llama-2-7b-hf
SAMViT-Hsam_vit_h_4b8939.pth
GroundingDINOswint_ogcgroundingdino_swint_ogc.pth

Note:

- LLaMA2 is trained with bf16, convergence error may happen when stage 1 training with fp16.

- The default tokenizer.legacy of Llama-2 is False, and may rise tokenization mismatch error with some conversation template.

- Errata: The base LLM used in the paper is Vicuna-v1.1, not LLaMA2. Sorry about the mistake.

  1. Prepare datasets
  2. Set config in
configs/train/ullava_core_stage1.yaml

Note set all datasets path or output path according to your experiments. 4. Train Stage I with multi GPUs

./shells/pretrain.sh

or python train_ullava_core.py --cfg_path './configs/train/ullava_core_stage1.yaml' for 1 GPU.

The first stage with 4 A100 80G with bf16 costs ~6hours for 1 epoch. Then you can find the trained model at the output_dir, for example, './exp/ullava_core_7b'

Stage II: Fine-tuning

After Stage I training finished, we can go through the following step, that is, fine-tuning.

  1. Prepare datasets
  2. Set config in
configs/train/ullava_stage2_lora.yaml (for lora)
configs/train/ullava_stage2.yaml (for non lora)
  1. Train Stage II with multi GPUs
./shells/finetune.sh

or python train_ullava.py --cfg_path './configs/train/ullava_stage2_lora.yaml' for 1 GPU.

Common Question

Q1: What conv_tpye used in training?

A1: Stage I: 'conv_simple'. Stage II: 'conv_sep2'

Q2: When LoRA used?

A2: Stage I: We have not used in this stage. Stage II: According to your devices.

<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- Evaluation -->

Evaluation

Batch evaluation

  1. Set config
configs/eval/eval_res.ymal (for RES task)
configs/eval/eval_rec.ymal (for REC task)
configs/eval/eval_salient.ymal (for Salinet segmentation task)
  1. Run
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_res.yaml' (for RES)
python evaluation/eval_ullava_grounding.py --cfg_path './configs/eval/eval_rec.yaml' (for REC)
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_salient.yaml' (for Salinet)
<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- Qualitative Evaluation -->

Qualitative inference

Modify the parser in the evaluation/inference_ullava_core.py and evaluation/inference_ullava.py for stage I and stage II, respectively.

python evaluation/eval_ullava.py
python evaluation/eval_ullava_grounding.py 
<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- LICENSE -->

License

Distributed under the Apache License. See LICENSE for more information.

<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- Citation -->

Citation

@inproceedings{xu2024ullava,
  title={u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model},
  author={Xu, Jinjin and Xu, Liwu and Yang, Yuzhe and Li, Xiang and Wang, Fanyi and Xie, Yanchun and Huang, Yi-Jie and Li, Yaqian},
  booktitle={Proceedings of the 27th European Conference on Artificial Intelligence},
  year={2024}
}
<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- TODO -->

TODO

<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- ACKNOWLEDGMENTS -->

Acknowledgments

We sincerely thank the open source community for their contributions. And this work is sponsored by Shanghai Pujiang Program (23PJ1421800).

<p align="right">(<a href="#readme-top">back to top</a>)</p>

See the open issues for a full list of proposed features (and known issues).

<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- MARKDOWN LINKS & IMAGES --> <!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->