Awesome
π LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
[Project Page] [Arxiv] [Demo] [Model Zoo]
<!-- [[`Paper`](xxx)] [[`BibTex`](#black_nib-citation)] -->:fire: News
[2024/1/14] Our training code is released.
[2023/12/6] Our paper is available in arxiv.
Contents
Install
- Clone this repository and navigate to LLaVA-Grounding fold:
git clone https://github.com/UX-Decoder/LLaVA-Grounding.git
cd LLaVA-Grounding
- Install required packages:
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
- Install packages necessary for OpenSeeD and Semantic-SAM.
LLaVA-Grounding Weights
Please check out our Model Zoo for all public LLaVA-Grounding checkpoints, and the instructions on how to use the weights.
Demo
After downloading model weights, simply conduct the following commends to run demo on your own machine.
CUDA_VISIBLE_DEVICES=0 python gradio_demo/LLaVA_G_Demo.py --path_vision_cfg path_to_vision_cfg --path_inter_cfg path_to_inter_cfg --model_path path_to_ckpt_dir
# for example, after downloading weights into checkpoints/llava_grounding
CUDA_VISIBLE_DEVICES=0 python gradio_demo/LLaVA_G_Demo.py --path_vision_cfg configs/openseed/openseed_swint_lang_joint_2st_visual_prompt.yaml --path_inter_cfg configs/semsam/visual_prompt_encoder.yaml --model_path checkpoints/llava_grounding
Please refer to our Online Demo for the more detailed user's guidence.
Training data
data
βββ flickr30k_entities
β βββ train/
β βββ val/
β βββ annotations
β βββfinal_flickr_separateGT_train.json
β βββfinal_flickr_separateGT_val.json
βββ coco
β βββ train2014/
β βββ train2017/
β βββ panoptic_train2017/
β βββ panoptic_semseg_train2017/
β βββ annotations
β β βββinstances_train2017.json
β β βββinstances_train2017_gvc.json
β β βββgrounded_visual_chat_data.json
β β βββinstances_train2014_filter.json
β β βββpanoptic_train2017_filter.json
β β βββgrounding_train2017.json
βββ llava
β βββ annotations
β βββ cap600k_brackets_all.json
β βββ llava_instruct_150k.json
β βββ llava_instruct_150k_visual_prompt.json
Flickr30k
Please refer to MDETR's pre-processed flickr30k data.
COCO
Please download coco train2014 and train2017 images and panoptic segmentation and semantic segmentation data. Other annoations can be downloaded here.
LLaVA
The processed annotations can be downloaded here.
Training
Stage 1
bash scripts/pretrain_joint.py
Stage 2
bash scripts/finetune.py
Stage 3
bash scripts/finetune_visual_prompt.py
Citation
If you find LLaVA-Grounding useful for your research and applications, please cite using this BibTeX:
@misc{zhang2023llavagrounding,
title={LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models},
author={Hao Zhang and Hongyang Li and Feng Li and Tianhe Ren and Xueyan Zou and Shilong Liu and Shijia Huang and Jianfeng Gao and Lei Zhang and Chunyuan Li and Jianwei Yang},
year={2023},
booktitle={arXiv}
}
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={arXiv:2304.08485},
year={2023}
}