Home

Awesome

Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics

Haoqin Tu*, Bingchen Zhao*, Chen Wei, Cihang Xie (*Equal Contribution)

Code License Data License

Our paper is online now: https://arxiv.org/abs/2309.07120

<p align="center"> <img src="teaser.png" width="1080"> </p>

Installation

Please follow LLaVA for setting up the training environment.

Model Weights

We list all the model and vision-text projector weights used in the paper

ModelPretrain WeightsInstruction Tuned Weights
LLaMA-7BckptFinetune ckpt
Vicuna-7BckptFinetune ckpt
LLaMA-3BckptFinetune ckpt
LoRA ckpt
Alpaca-3BckptFinetune ckpt
LoRA ckpt
LLaMA2-7BckptFinetune ckpt
LoRA ckpt
LLaMA2-chat-7BckptFinetune ckpt
LoRA ckpt

Evaluations

For NLP & Multi-Modal data and evaluations, please see instructions here.

Model Training

We follow the training paradigm of LLaVA, which consists of two stages: (1) feature alignment: use approximately 600K filtered CC3M to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning: use filtered 80K GPT-generated visual instruction data (see here) to teach the model to follow multimodal instructions.

Feature Alignment Training

Please download the subset of the CC3M dataset we use in the paper here. You can check the pretraining script

<details> <summary>Pretrain: LLaMA2-7B.</summary>
deepspeed llava/train/train.py --deepspeed scripts/zero3.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --version v0 \
    --data_path /path/to/cc3m_595k.json \
    --image_folder /path/to/cc3m_595k_images \
    --vision_tower openai/clip-vit-large-patch14 \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 True \
    --output_dir ./checkpoints/MM-LLaMA2-7B-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb
</details>

Visual Instruction Tuning

  1. Data preparation: Please download llava_instruct_80k.json and COCO train2017 images here
  2. Training: You can download our pretrained projector here, and check the finetuning script or LoRA tuning script.
<details> <summary>Visual Instruction Tuning: MM-LLaMA2-7B-ft.</summary>
deepspeed llava/train/train.py --deepspeed scripts/zero2.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --version llava_llama_2 \
    --data_path path/to/llava_instruct_80k.json \
    --image_folder /path/to/coco/train2017/ \
    --vision_tower openai/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter ./checkpoints/MM-LLaMA2-7B-pretrain/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 True \
    --output_dir ./checkpoints/MM-LLaMA2-7B-ft \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb
</details>

Usage and License Notices

The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Citation

If you find this repo useful for your your research and applications, please cite using this BibTeX:

@article{tu2023sight,
  title={Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics},
  author={Tu, Haoqin and Zhao, Bingchen and Wei, Chen and Xie, Cihang},
  journal={arXiv preprint arXiv:2309.07120},
  year={2023}
}

Acknowledgement

This work is partially supported by a gift from Open Philanthropy. We thank Center for AI Safety for supporting our computing needs. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

Related Projects