Awesome
Pink: Unveiling The Power of Referential Comprehension for Multi-modal LLMs.
Pink: Unveiling The Power of Referential Comprehension for Multi-modal LLMs Shiyu Xuan, Qingpei Guo, Ming Yang, Shiliang Zhang
CVPR 2024
Contents
Pink Weights
- Base: Pink_Base
- Base_Object365: Pink_Object365
- Base_RefCOCO: Pink_Refcoco
Data Download
Pretraining Dataset
The pretraining dataset used in this release is the same as in LLaVA which is a subset of CC-3M dataset. Please see here for a detailed description on the dataset structure and how to download the images.
Instruction Tuning Dataset
The datasets mentioned in the image need to be downloaded manually.
- COCO: train2017
- VisualGenome: part1, part2, objects, relationships, region descriptions
- Object365: Object365
- A-OKVQA: A-OKVQA
- LLaVA-158K: LLaVA-158K
We also provide the converted dataset used in the instruction tuning:
https://huggingface.co/datasets/SY-Xuan/Pink_sft/
LLaMA2 Weight Download
Our model is based on Llama-2-7b-chat-hf. You need to download the weights manually.
- Llama-2-7b-chat-hf: Llama-2-7b-chat-hf
Install
- Install Package
conda create -n pink python=3.10 -y
conda activate pink
pip install --upgrade pip # enable PEP 660 support
pip install -e .
Training
Stage 1
bash scripts/stage1.sh
Stage 2
bash scripts/stage2.sh
Stage 2 with Object365
bash scripts/stage2_with_object365.sh
Self-consistent Bootstrapping
We convert the *.json of Object365. Please refer to dataset_generation/object365_detection.py
Bootstrapping
bash scripts/object365_generate.sh
Self-consistent
Please refer to pink/eval/object365_filter.py
Evaluation
Please refer to inference.ipynb and scripts/eval_refcoco.sh.
Demo
To launch a Gradio web demo, use the following command.
python demo.py --checkpoint-path /path/to/pink --llama-path /path/to/llama2
Citation
If you find Pink useful for your research and applications, please cite using this BibTeX:
@InProceedings{Xuan_2024_CVPR,
author = {Xuan, Shiyu and Guo, Qingpei and Yang, Ming and Zhang, Shiliang},
title = {Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {13838-13848}
}
Acknowledgement
This code inherits some codes from LLaVA and Shikra. Thanks for these outstanding implementations.
Contact me
If you have any questions about this code or paper, feel free to contact me at shiyu_xuan@stu.pku.edu.cn.
Related Projects
LocLLM: We leverage LLM for the human keypoint localization. LocLLM shows remarkable performance on standard 2D/3D keypoint localization benchmarks. Moreover, incorporating language clues into the localization makes LocLLM show superior flexibility and generalizable capability in cross dataset keypoint localization, and even detecting novel type of keypoints unseen during training.
Ant-Multi-Modal-Framework: This repository contains codes for multi-modality learning from the Multimodal Cognition group of Ant Group that have been integrated into AntMMF.