Home

Awesome

LEOPARD <img src="figures/leopard.png" alt="" width="28" height="28">: A Vision Language Model for Text-Rich Multi-Image Tasks

<center><img src="figures/intro.png" alt="Auto-Instruct Illustration" width="" height=""></center>

This is the repository for Leopard, a MLLM that is specifically designed to handle complex vision-language tasks involving multiple text-rich images. In real-world applications, such as presentation slides, scanned documents, and webpage snapshots, understanding the inter-relationships and logical flow across multiple images is crucial.

The code, data, and model checkpoints will be released in one month. Stay tuned!

<p align="center"> <a href="https://arxiv.org/abs/2410.01744"> <img style="height:22pt" src="https://img.shields.io/badge/-Paper-black?style=flat&logo=arxiv"></a> <a href="https://github.com/tencent-ailab/Leopard"> <img style="height:22pt" src="https://img.shields.io/badge/-Code-green?style=flat&logo=github"></a> <a href="https://huggingface.co/datasets/wyu1/Leopard-Instruct"> <img style="height:22pt" src="https://img.shields.io/badge/-🤗%20Dataset-red?style=flat"></a> <a href="https://huggingface.co/wyu1/Leopard-LLaVA"><img style="height:22pt" src="https://img.shields.io/badge/-🤗%20Models-red?style=flat"></a> </p>

Updates

Key Features:

<center><img src="figures/model.png" alt="Auto-Instruct Illustration" width="" height=""></center>

Evaluation

For evaluation, please refer to the Evaluations section.

Model Zoo

We provide the checkpoints of Leopard-LLaVA and Leopard-Idefics2 on Huggingface.