Home

Awesome

<div align=center>

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

</div>

A reimplementation of the key modules APS and DTS in DocKylin. Due to company policy restrictions, the original DocKylin code cannot be open-sourced. This reimplementation is provided here and may have slight differences.

<p align="center"> <img src="img/model.png" width='500'> </p>

Adaptive Pixel Slimming (APS)

python aps.py --im_path 'demo/' --resize --visualize

Some results when applying APS to existing MLLMs

MethodsSupported ResolutionDocVQAInfoVQASROIEFUNSD
LLaVA1.5224x2248.514.71.70.2
LLaVA1.5+APS224x22410.7 (+27.4%)14.7 (+0%)3.7 (+118%)0.9 (+360%)
QwenVL448x44848.123.934.520.6
QwenVL+APS448x44851.2 (+6.4%)24.7 (+4.1%)40.0 (+15.9%)24.3 (+17.9%)
Monkey896x89650.125.841.924.1
Monkey+APS896x89656.3 (+12.4%)27.5 (+6.6%)47.0 (+12.2%)27.3 (+13.3%)
InternVL2448x448x(1~12)76.249.554.741.7
InternVL2+APS448x448x(1~12)76.148.254.240.6
InternVL2+APS+Resize448x448x(1~12)77.3 (+1.4%)49.4 (-0.2%)55.2 (+0.9%)43.4 (+4.1%)

Dynamic Token Slimming (DTS)

DTS needs to be applied to a trained image encoder and linear projection layer, so no corresponding demo is provided here. Please refer to the code and the associated comments for customized usage.

Citation

If you are using our code and data, please consider citing our paper.

@inproceedings{zhang2024dockylin, 
Author = {Zhang, Jiaxin and Yang, Wentao and Lai, Songxuan and Xie, Zecheng and Jin, Lianwen}, 
Booktitle = {Proceedings of the AAAI conference on artificial intelligence}, 
Title = {Dockylin: A large multimodal model for visual document understanding with efficient visual slimming}, 
Year = {2025}}   

⭐ Star Rising

Star Rising

Some codes are based on TextMonkey and TPS. Thanks to all the authors for their great work.