Home

Awesome

πŸŒ‹ LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

[Project Page] [Arxiv] [Demo] [Model Zoo]

<!-- [[`Paper`](xxx)] [[`BibTex`](#black_nib-citation)] -->

:fire: News

[2024/1/14] Our training code is released.

[2023/12/6] Our paper is available in arxiv.

Contents

Install

  1. Clone this repository and navigate to LLaVA-Grounding fold:
git clone https://github.com/UX-Decoder/LLaVA-Grounding.git
cd LLaVA-Grounding
  1. Install required packages:
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
  1. Install packages necessary for OpenSeeD and Semantic-SAM.

LLaVA-Grounding Weights

Please check out our Model Zoo for all public LLaVA-Grounding checkpoints, and the instructions on how to use the weights.

Demo

After downloading model weights, simply conduct the following commends to run demo on your own machine.

CUDA_VISIBLE_DEVICES=0 python gradio_demo/LLaVA_G_Demo.py --path_vision_cfg path_to_vision_cfg --path_inter_cfg path_to_inter_cfg --model_path path_to_ckpt_dir

# for example, after downloading weights into checkpoints/llava_grounding
CUDA_VISIBLE_DEVICES=0 python gradio_demo/LLaVA_G_Demo.py --path_vision_cfg configs/openseed/openseed_swint_lang_joint_2st_visual_prompt.yaml --path_inter_cfg configs/semsam/visual_prompt_encoder.yaml --model_path checkpoints/llava_grounding

Please refer to our Online Demo for the more detailed user's guidence.

Training data

data
β”œβ”€β”€ flickr30k_entities
β”‚   β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ val/
β”‚   β”œβ”€β”€ annotations
β”‚          β”œβ”€β”€final_flickr_separateGT_train.json
β”‚          β”œβ”€β”€final_flickr_separateGT_val.json
β”œβ”€β”€ coco
β”‚   β”œβ”€β”€ train2014/
β”‚   β”œβ”€β”€ train2017/
β”‚   β”œβ”€β”€ panoptic_train2017/
β”‚   β”œβ”€β”€ panoptic_semseg_train2017/
β”‚   β”œβ”€β”€ annotations
β”‚   β”‚      β”œβ”€β”€instances_train2017.json
β”‚   β”‚      β”œβ”€β”€instances_train2017_gvc.json
β”‚   β”‚      β”œβ”€β”€grounded_visual_chat_data.json
β”‚   β”‚      β”œβ”€β”€instances_train2014_filter.json
β”‚   β”‚      β”œβ”€β”€panoptic_train2017_filter.json
β”‚   β”‚      β”œβ”€β”€grounding_train2017.json
β”œβ”€β”€ llava
β”‚   β”œβ”€β”€ annotations
β”‚          β”œβ”€β”€ cap600k_brackets_all.json
β”‚          β”œβ”€β”€ llava_instruct_150k.json
β”‚          β”œβ”€β”€ llava_instruct_150k_visual_prompt.json

Flickr30k

Please refer to MDETR's pre-processed flickr30k data.

COCO

Please download coco train2014 and train2017 images and panoptic segmentation and semantic segmentation data. Other annoations can be downloaded here.

LLaVA

The processed annotations can be downloaded here.

Training

Stage 1

bash scripts/pretrain_joint.py

Stage 2

bash scripts/finetune.py

Stage 3

bash scripts/finetune_visual_prompt.py

Citation

If you find LLaVA-Grounding useful for your research and applications, please cite using this BibTeX:


@misc{zhang2023llavagrounding,
      title={LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models},
      author={Hao Zhang and Hongyang Li and Feng Li and Tianhe Ren and Xueyan Zou and Shilong Liu and Shijia Huang and Jianfeng Gao and Lei Zhang and Chunyuan Li and Jianwei Yang},
      year={2023},
      booktitle={arXiv}
}

@misc{liu2023llava,
      title={Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={arXiv:2304.08485},
      year={2023}
}