Home

Awesome

TextHawk: 🥇 LVLM with 16x Compression Ratio

<a href="https://arxiv.org/abs/2410.05261" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/TextHawk2-arXiv/2410.05261-red?logo=arxiv"/></a> <a href="https://arxiv.org/abs/2404.09204" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/TextHawk-arXiv/2404.09204-red?logo=arxiv"/></a> <a href="https://zhuanlan.zhihu.com/p/939288220" target="_blank"><img alt="ZhiHu" src="https://img.shields.io/badge/TextHawk2-ZhiHu-1E90FF?logo=zhihu&logoColor=02B5FD"/>

Base Models

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

TextHawk: Efficient Fine-Grained Perception of Multimodal Large Language Models

GUI Agents

UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents

Introduction

The TextHawk series represents a cutting-edge family of Large Vision-Language Models (LVLMs) designed for highly efficient fine-grained perception. Notably, TextHawk sets a milestone as the first LVLM to achieve a 16x token compression ratio. This is made possible through the integration of four key components:

architecture

Building on the same architecture, TextHawk2 enhances performance by leveraging greater data diversity and reinforcing the visual encoder. This iteration achieves state-of-the-art results across multiple benchmarks, excelling in tasks related to general multimodal understanding, Optical Character Recognition (OCR), and visual grounding.

For instance, TextHawk2 delivers impressive metrics such as 78.4% accuracy on OCRBench, 81.4% accuracy on ChartQA, 89.6% ANLS on DocVQA, and 88.1% accuracy@0.5 on RefCOCOg-test.

compression

TextHawk series can compress multiple times more words displayed on a small image, where each character measures under 8 pixels, into a few tokens, allowing for accurate recovery. It’s reminiscent of the futuristic gadgets in Doraemon anime.

examples

DocGemini

We create a new instruction-tuning dataset DocGemini for document-oriented tasks by enriching the multimodal document data with Gemini Pro. Each data sample contains:

DocGemini consists of 30K images and 195K QA pairs with insights.

DatasetQAConversation
DocVQAlinklink
ChartQAlinklink
InfoVQAlinklink

Note: Alternatively, you can produce data on your own using the scripts we provide.

Benchmarks

ocr

grounding

proprietary

<details> <summary>TextHawk</summary>
ModelViT<br>(Params.)MME<br>perceptionMMB<br>devSEED<br>imageGQADocVQAChartQAInfoVQATabFactWTQRefCOCO<br>valRefCOCO<br>test-ARefCOCO<br>test-B
$\text{Donut}$$\text{Swin-B}$<br>(0.1B)----67.541.811.654.618.8---
$\text{Pix2Struct}$-----76.658.640.0-----
$\text{InternLM-XC}$$\text{EVA-G}$<br>(1B)1528.474.866.1---------
$\text{LLaVA-1.5-7B}$$\text{CLIP-L}$<br>(0.3B)1510.765.2-62.0--------
$\text{Shikra-7B}$$\text{CLIP-L}$<br>(0.3B)-58.8-------87.0<ins>91.1</ins>81.8
$\text{Qwen-VL-Chat}$$\text{CLIP-G}$<br>(2B)1487.660.665.457.562.666.3---88.692.384.5
$\text{Monkey}$$\text{CLIP-G}$<br>(2B)-59.3-60.766.565.136.1-25.3---
$\text{UReader}$$\text{CLIP-L}$<br>(0.3B)----65.459.342.267.629.4---
$\text{TextMonkey}$$\text{CLIP-G}$<br>(2B)----73.066.9--31.9---
$\textbf{TextHawk}^*$$\text{SigLIP-SO}$<br>(0.4B)<ins>1520.9</ins>73.069.264.7<ins>73.6</ins>64.0<ins>47.3</ins><ins>70.7</ins><ins>33.5</ins><ins>87.3</ins>90.9<ins>83.3</ins>
$\textbf{TextHawk}$$\text{SigLIP-SO}$<br>(0.4B)1500.0<ins>74.6</ins>69.2<ins>64.6</ins>76.4<ins>66.6</ins>50.671.134.787.290.882.5

Note: $\textbf{TextHawk}^*$ is fine-tuned without the DocGemini.

</details>

Visualization

markdown

reg

BibTex

@article{yu24texthawk2,
  author       = {Ya{-}Qi Yu and Minghui Liao and Jiwen Zhang and Jihao Wu},
  title        = {TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens},
  journal      = {CoRR},
  volume       = {abs/2410.05261},
  year         = {2024}
}
@article{yu24texthawk,
  author       = {Ya{-}Qi Yu and Minghui Liao and Jihao Wu and Yongxin Liao and Xiaoyu Zheng and Wei Zeng},
  title        = {TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models},
  journal      = {CoRR},
  volume       = {abs/2404.09204},
  year         = {2024}
}
@article{zhang24uihawk,
  title        = {{UI-Hawk}: Unleashing the Screen Stream Understanding for GUI Agents},
  author       = {Jiwen Zhang and Yaqi Yu and Minghui Liao and Wentao Li and Jihao Wu and Zhongyu Wei},
  journal      = {Preprints},
  volume       = {manuscript/202408.2137},
  year         = {2024}
}