Awesome

Welcome to Griffon

</div>

This is the official repository for the Griffon series (v1, v2, and G). Griffon is the first high-resolution (over 1K) LVLM capable of performing fine-grained visual perception tasks, such as object detection and counting. In its latest version, Griffon integrates vision-language and vision-centric tasks within a unified end-to-end framework. You can interact with Griffon and request it to complete various tasks. The model is continuously evolving towards greater intelligence to handle more complex scenarios. Feel free to follow Griffon and reach out to us by raising an issue.

Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models（Latest）

📕Paper 🌀Usage 🤗Model(NEW)

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

📕Paper 🌀Intro

Griffon: Spelling out All Object Locations at Any Granuality with Large Language Model

📕Paper 🌀Usage 🤗Model

Release

2024.11.26 🔥We are glad to release inference code and the model of Griffon-G in 🤗Griffon-G. Training codes will be released later.
2024.07.01 🔥Griffon has been accepted to ECCV 2024. Data is released in 🤗HuggingFace
2024.03.11 🔥We are excited to announce the arrival of Griffon v2. Griffion v2 brings fine-grained perception performance to new heights with high-resolution expert-level detection and counting and supports visual-language co-referring. Take a look at our demo first. Paper is preprinted in 📕Arxiv.
2023.12.06 🔥Release the Griffon v1 inference code and model in 🤗HuggingFace.
2023.11.29 🔥Griffon v1 Paper has been released in 📕Arxiv.

What can Griffon do now?

Griffon-G demonstrates advanced performance across multimodal benchmarks, general VQAs, and text-rich VQAs, achieving new state-of-the-art results in REC and object detection. More quantitative evaluation results can be found in our paper.

Get Started

1.Clone & Install

git clone git@github.com:jefferyZhan/Griffon.git
cd Griffon
pip install -e .

Tips: If you encounter any errors while installing the packages, you can always download the corresponding source files (*.whl), which have been verified by us.

2.Download the Griffon and CLIP models to the checkpoints folder.

Model	Links
Griffon-G-9B	`🤗HuggingFace`
Griffon-G-27B	`🤗HuggingFace`
clip-vit-large-path14	`🤗HuggingFace`
clip-vit-large-path14-336_to_1022	`🤗HuggingFace`

3.Inference

# 3.1 Modify the instruction in the run_inference.sh.

# 3.2.1 DO NOT USE Visual Prompt
bash run_inference.sh [CUDA_ID] [CHECKPOINTS_PATH] [IMAGE_PATH]

# 3.2.2 USE Visual Prompt for COUNTING: Input both query image and prompt image splited with comma and specify <region> placeholder in the instruction
bash run_inference.sh [CUDA_ID] [CHECKPOINTS_PATH] [IMAGE_PATH,PROMPT_PATH]

Notice: Please pay attention to the singular and plural expressions of objects.

Acknowledgement

LLaVA provides the base codes and pre-trained models.
Shikra provides insight of how to organize datasets and some base processed annotations.
Llama provides the large language model.
Gemma2 provides the large language model.
volgachen provides the basic environment setting config.

Citation

If you find Griffon useful for your research and applications, please cite using this BibTeX:

@inproceedings{zhan2025griffonv1,
  title={Griffon: Spelling out all object locations at any granularity with large language models},
  author={Zhan, Yufei and Zhu, Yousong and Chen, Zhiyang and Yang, Fan and Tang, Ming and Wang, Jinqiao},
  booktitle={European Conference on Computer Vision},
  pages={405--422},
  year={2025},
  organization={Springer}
}

@misc{zhan2024griffonv2,
      title={Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring}, 
      author={Yufei Zhan and Yousong Zhu and Hongyin Zhao and Fan Yang and Ming Tang and Jinqiao Wang},
      year={2024},
      eprint={2403.09333},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{zhan2024griffon-G,
  title={Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models},
  author={Zhan, Yufei and Zhao, Hongyin and Zhu, Yousong and Yang, Fan and Tang, Ming and Wang, Jinqiao},
  journal={arXiv preprint arXiv:2410.16163},
  year={2024}
}

License

The data and checkpoint is licensed for research use only. All of them are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Gemma2, and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.