Awesome

Background

The Huggingface version of Vary-tiny suffers potential issues, leading to the loss being hard to converge under multiple epochs.
Many friends are very interested in the train data of Vary.

Release

[2024/9/03] 🔥🔥🔥 We release a very strong and comprehensive OCR model GOT-OCR2.0.
[2024/4/21] 🔥🔥🔥 For OneChart, we have released the web demo in Project Page. Have fun!!
[2024/4/21] 🔥🔥🔥 We present a Vary-tiny LAVIS codebase and the Vary-600k dataset !!!

Install
Train
Demo
Vary-600k

Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only.

Install

Clone this repository and navigate to the Vary-tiny-600k folder

git clone https://github.com/Ucas-HaoranWei/Vary-tiny-600k.git
cd LAVIS-main

Install Package

pip install -e .

Prepare Pretrain Weights and Data
- download the OPT-125M here and the SAM-b weights here
- download the Vary-600k here with code "vary"
- prepare the dirs as follows:

Train

python -m torch.distributed.run --nproc_per_node=8 --master_port=29501 train.py --cfg-path lavis/projects/varytiny/train/pretrain.yaml

or multi machines

python -m torch.distributed.run --master_addr xxx --master_port xxx --node_rank xxx --nnodes xxx --nproc_per_node xxx  train.py --cfg-path lavis/projects/varytiny/train/pretrain.yaml

If your training goes smoothly, your loss (end of each epoch) will be similar to the following (2×8 H800)：

Demo

change the "pretrained" and "finetuned" path with your checkpoints in ``LAVIS-main/lavis/configs/models/varytiny/varytiny_inference.yaml'', such as:

python tests/models/test_varytiny.py  --image-file  xxx.jpg

We also provide the model weights we trained Vary-tiny upon Vary-600k from scratch: Vary-tiny-600k.pth. Code: "Vary". You can use it and directly run the inference.

Vary-600k

Vary-600k is a PDF image-text pair dataset with about 30W English and 30W Chinese pages.
The dataset is extracted using Fitz. A BERT model is used to merge sentences within paragraphs. Paragraphs are separated by "<lb>". The reason why we do not use "\n" is because we use "\n" as the "EOS" of opt-125m in this codebase.
You can use Vary-600k for your pretrain, warm-up, and so on.
Note that Vary-600k is only a sub-data of the pretrain data used in the original Vary.
Download Vary-600k here. Code: "Vary"

Acknowledgement

LAVIS: the codebase we built upon!

Citation

If you find our work useful in your research, please consider citing Vary:

@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}

@article{wei2024small,
  title={Small Language Model Meets with Reinforced Vision Vocabulary},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yu, En and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2401.12503},
  year={2024}
}