Home

Awesome

<h3><a href="">Small Language Model Meets with Reinforced Vision Vocabulary</a></h3> <a href="https://varytoy.github.io/"><img src="https://img.shields.io/badge/Project-Page-Green"></a> <a href="https://arxiv.org/abs/2401.12503"><img src="https://img.shields.io/badge/Paper-PDF-orange"></a> <a href="https://vary.xiaomy.net/"><img src="https://img.shields.io/badge/demo-blue"></a> <a href="https://zhuanlan.zhihu.com/p/679447793"><img src="https://img.shields.io/badge/zhihu-yellow"></a>

<a href="https://trendshift.io/repositories/7311" target="_blank"><img src="https://trendshift.io/api/badge/repositories/7311" alt="Ucas-HaoranWei%2FVary-toy | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>

Haoran Wei*, Lingyu Kong*, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang

<p align="center"> <img src="assets/vary-toy-logo.jpg" style="width: 200px" align=center> </p> <p align="center"> <a href="">The Young's First ``Large'' Vision Language Model</a> </p>

Release

Code License Data License Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of LLaMA, Vicuna, GPT-4, Qwen, and LLaVA.

Contents

Note

If you have built the original Vary, please rebuild this repo !!!

Install

  1. Clone this repository and navigate to the Vary folder
git clone https://github.com/Ucas-HaoranWei/Vary-toy.git
cd /path/to/vary-toy
  1. Install Package
conda create -n vary python=3.10 -y
conda activate vary
pip install e .
  1. Install Flash-Attention
pip install ninja
pip install flash-attn --no-build-isolation

Vary Weights

Demo

  1. Update the CLIP-VIT path in the codes (/cache/vit-large-patch14/) to your path.

python vary/demo/run_qwen_vary.py  --model-name  /vary/model/path/ --image-file /an/image/file.png

Train

deepspeed   Vary/train/train_qwen_vary.py  --deepspeed /Vary/zero_config/zero2.json
            --model_name_or_path /Vary-toy/path/
            --vision_tower /vit-large-patch14/path/
            --freeze_vision_tower True
            --freeze_lm_model False
            --vision_select_layer  -2
            --use_im_start_end True
            --bf16 True
            --per_device_eval_batch_size 4
            --gradient_accumulation_steps 1
            --evaluation_strategy "no"
            --save_strategy "steps"
            --save_steps 5000
            --save_total_limit 1
            --weight_decay 0.
            --warmup_ratio 0.03
            --lr_scheduler_type "cosine"
            --logging_steps 1 --tf32 True
            --model_max_length 4096
            --gradient_checkpointing True
            --dataloader_num_workers 4
            --report_to none
            --per_device_train_batch_size 4
            --num_train_epochs 1
            --learning_rate 5e-5
            --datasets  data_name1+data_name2+data_name3
            --output_dir /path/to/output/

We encourage you to extract the new vision vocabulary weights for your new base language model !!!

Contact

If you have any questions about the code or the paper, please email (weihaoran18@mails.ucas.ac.cn).

Discussion

Vary-toy is not a toy, and we have designed two excellent models based on it, one is Vary-document (specifically for document/pdf processing), and the other is Vary-plot for chart analysis. You can see their amazing performance here Vary-family.

Citation

If you find our work useful in your research, please consider citing Vary:

@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}

@article{wei2024small,
  title={Small Language Model Meets with Reinforced Vision Vocabulary},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yu, En and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2401.12503},
  year={2024}
}