Home

Awesome

Libra: Building Decoupled Vision System on Large Language Models

This repository provides a simple implementation of Libra in PyTorch, including pretraining, finetuning, and inference.

Please refer to the ICML 2024 paper:

Libra: Building Decoupled Vision System on Large Language Models

Yifan Xu, Xiaoshan Yang, Yaguang Song, Changsheng Xu

Preparation

ENVIRONMENT. Install the required dependencies:

pip install -r requirements.txt

DATA. The code supports data in the webdatasets, coco, LLaVA-instruction formats, specifically as:

DATASETS/
├── laion/
│   ├── 00000.tar
│   ├── 00001.tar
│   ├── ...
│   └── 07776.tar
├── instruction/
│   ├── llava_v1_5_mix665k.json
│   ├── data/
│   |   ├── coco/
│   |   ├── gqa/
│   |   ├── ...
│   └── └── vg
└── coco/
    ├── annotations/
    │   ├── coco_karpathy_train.json
    |   └── ...
    ├── train2017/
    ├── val2017/
    ├── train2014/
    └── ...

CHECKPOINTS. If you want to train Libra from scratch, several praparations are needed. Otherwise you can just skip this step.

  1. Prepare the huggingface version of the llama-2-7b-chat-hf model. Please refer to here. Then rename the folder name to llama-2-7b-chat-hf-libra.
  2. Merge the vision tokenizer weight into the pretrained llama path. The pretrained vision tokenizer weight can be found here.
  3. Download the pretrained CLIP model in huggingface and merge it into the pretrained model paths. The CLIP model can be downloaded here.

If you want to run the official Libra models, you need to download libra-11b-chat or libra-11b-base.

The final checkpoint path should be like:

CHECKPOINTS/
├── libra-11b-base/
│   ├── ...
│   └── openai-clip-vit-large-patch14-336/
│       └── ...    
├── libra-11b-chat/
│   ├── ...
│   └── openai-clip-vit-large-patch14-336/
│       └── ...    
└── llama-2-7b-chat-hf-libra/
    |
    │   # original llama files
    |
    ├── config.json
    ├── pytorch_model-00001-of-00002.bin
    ├── ...
    ├── tokenizer.model
    │   
    │   # newly added vision tokenizer
    │   
    ├── vision_tokenizer_config.yaml
    ├── vqgan.ckpt
    │
    │   # CLIP model
    │
    └── openai-clip-vit-large-patch14-336/
        └── ...    

Inference

We provide a simple jupyter demo here.

Pretraining

We use the LAION dataset for pretraining. Please refer to the config file for detailed usage. The training command is:

torchrun --nnodes=5 --nproc_per_node=8 train.py --cfg-path libra/configs/libra_pretrain.yaml

Instruction Tuning

The code supports finetuning data in the LLaVA instruction format. Please refer to LLaVA to organize the data. Or you can use customized data, as long as its annonation is similar to llava_v1_5_mix665k.json.

torchrun --nnodes=1 --nproc_per_node=8 train.py --cfg-path libra/configs/libra_instruction.yaml

Model Weights

We provide the pretrained base model (Libra-Base) and the model after instruction tuning (Libra-Chat).

ModelUrl
Libra-BaseHuggingFace
Libra-ChatHuggingFace

Citation

If you find our work helpful, please consider citing:

@InProceedings{xu2024libra,
  title = {Libra: Building Decoupled Vision System on Large Language Models},
  author = {Xu, Yifan and Yang, Xiaoshan and Song, Yaguang and Xu, Changsheng},
  booktitle = {Proceedings of the 41st International Conference on Machine Learning},
  pages = {55371--55388},
  year = {2024},
  volume = {235},
  series = {Proceedings of Machine Learning Research},
  publisher = {PMLR},
}

Acknowledgments

We'd like to thank Menghao Hu from Pengcheng Laboratory for data management and Chaoyou Fu from Tencent for early discussion. The code was built upon LAVIS, Huggingface Trainer, and deepspeed. Thanks for their great works.