Awesome
Libra: Building Decoupled Vision System on Large Language Models
This repository provides a simple implementation of Libra in PyTorch, including pretraining, finetuning, and inference.
Please refer to the ICML 2024 paper:
Libra: Building Decoupled Vision System on Large Language Models
Yifan Xu, Xiaoshan Yang, Yaguang Song, Changsheng Xu
Preparation
ENVIRONMENT. Install the required dependencies:
pip install -r requirements.txt
DATA. The code supports data in the webdatasets, coco, LLaVA-instruction formats, specifically as:
DATASETS/
├── laion/
│ ├── 00000.tar
│ ├── 00001.tar
│ ├── ...
│ └── 07776.tar
├── instruction/
│ ├── llava_v1_5_mix665k.json
│ ├── data/
│ | ├── coco/
│ | ├── gqa/
│ | ├── ...
│ └── └── vg
└── coco/
├── annotations/
│ ├── coco_karpathy_train.json
| └── ...
├── train2017/
├── val2017/
├── train2014/
└── ...
CHECKPOINTS. If you want to train Libra from scratch, several praparations are needed. Otherwise you can just skip this step.
- Prepare the huggingface version of the
llama-2-7b-chat-hf
model. Please refer to here. Then rename the folder name tollama-2-7b-chat-hf-libra
. - Merge the vision tokenizer weight into the pretrained llama path. The pretrained vision tokenizer weight can be found here.
- Download the pretrained CLIP model in huggingface and merge it into the pretrained model paths. The CLIP model can be downloaded here.
If you want to run the official Libra models, you need to download libra-11b-chat
or libra-11b-base
.
The final checkpoint path should be like:
CHECKPOINTS/
├── libra-11b-base/
│ ├── ...
│ └── openai-clip-vit-large-patch14-336/
│ └── ...
├── libra-11b-chat/
│ ├── ...
│ └── openai-clip-vit-large-patch14-336/
│ └── ...
└── llama-2-7b-chat-hf-libra/
|
│ # original llama files
|
├── config.json
├── pytorch_model-00001-of-00002.bin
├── ...
├── tokenizer.model
│
│ # newly added vision tokenizer
│
├── vision_tokenizer_config.yaml
├── vqgan.ckpt
│
│ # CLIP model
│
└── openai-clip-vit-large-patch14-336/
└── ...
Inference
We provide a simple jupyter demo here.
Pretraining
We use the LAION dataset for pretraining. Please refer to the config file for detailed usage. The training command is:
torchrun --nnodes=5 --nproc_per_node=8 train.py --cfg-path libra/configs/libra_pretrain.yaml
Instruction Tuning
The code supports finetuning data in the LLaVA instruction format. Please refer to LLaVA to organize the data.
Or you can use customized data, as long as its annonation is similar to llava_v1_5_mix665k.json
.
torchrun --nnodes=1 --nproc_per_node=8 train.py --cfg-path libra/configs/libra_instruction.yaml
Model Weights
We provide the pretrained base model (Libra-Base) and the model after instruction tuning (Libra-Chat).
Model | Url |
---|---|
Libra-Base | HuggingFace |
Libra-Chat | HuggingFace |
Citation
If you find our work helpful, please consider citing:
@InProceedings{xu2024libra,
title = {Libra: Building Decoupled Vision System on Large Language Models},
author = {Xu, Yifan and Yang, Xiaoshan and Song, Yaguang and Xu, Changsheng},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
pages = {55371--55388},
year = {2024},
volume = {235},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR},
}
Acknowledgments
We'd like to thank Menghao Hu from Pengcheng Laboratory for data management and Chaoyou Fu from Tencent for early discussion. The code was built upon LAVIS, Huggingface Trainer, and deepspeed. Thanks for their great works.