Home

Awesome

Fine-tuning Llama3.2-Vision

This repository contains a script for training Llama3.2-Vision with only using HuggingFace and Liger-Kernel.

Other projects

[Phi3-Vision Finetuning]<br> [Qwen2-VL Finetuning]<br> [Molmo Finetuning]

Update

Table of Contents

Supported Features

Installation

Install the required packages using environment.yml.

Using environment.yaml

conda env create -f environment.yaml
conda activate llama

Note: Llama3.2-Vision does not support flash-attention2 for now.

Dataset Preparation

The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder.<br>

When using a multi-image dataset, the image tokens should all be <image>, and the image file names should have been in a list. Please see the example below and follow format your data.

<details> <summary>Example for single image dataset</summary>
[
  {
    "id": "000000033471",
    "image": "000000033471.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
      },
      {
        "from": "gpt",
        "value": "The bus in the image is white and red."
      },
      {
        "from": "human",
        "value": "What feature can be seen on the back of the bus?"
      },
      {
        "from": "gpt",
        "value": "The back of the bus features an advertisement."
      },
      {
        "from": "human",
        "value": "Is the bus driving down the street or pulled off to the side?"
      },
      {
        "from": "gpt",
        "value": "The bus is driving down the street, which is crowded with people and other vehicles."
      }
    ]
  }
  ...
]
</details> <details> <summary>Example for multi image dataset</summary>
[
  {
    "id": "000000033471",
    "image": ["000000033471.jpg", "000000033472.jpg"],
    "conversations": [
      {
        "from": "human",
        "value": "<image>\n<image>\nIs the perspective of the camera differnt?"
      },
      {
        "from": "gpt",
        "value": "Yes, It the perspective of the camera is different."
      }
    ]
  }
  ...
]
</details> <details> <summary>Example for video dataset</summary>
[
  {
    "id": "sample1",
    "video": "sample1.mp4",
    "conversations": [
      {
        "from": "human",
        "value": "<video>\nWhat is going on in this video?"
      },
      {
        "from": "gpt",
        "value": "A man is walking down the road."
      }
    ]
  }
  ...
]

Note: Llama3.2-Vision uses a video as a sequential of images.

</details>

Training

To run the training script, use the following command:

Full Finetuning

bash scripts/finetune.sh

Full Finetuning with 8-bit

bash scripts/finetune_8bit.sh

This script will finetune the model with 8bit-adamw and fp8 model dtype. If you run out of vram, you could use this.

Finetune with LoRA

If you want to train only the language model with LoRA and perform full training for the vision model:

bash scripts/finetune_lora.sh

If you want to train both the language model and the vision model with LoRA:

bash scripts/finetune_lora_vision.sh

IMPORTANT: If you want to tune the embed_token with LoRA, You need to tune lm_head together.

<details> <summary>Training arguments</summary>

Note: The learning rate of vision_model should be 10x ~ 5x smaller than the language_model.

</details>

Train with video dataset

You can train the model using a video dataset. However, Llama3.2-Vision processes videos as a sequence of images, so you’ll need to select specific frames and treat them as multiple images for training. You can set LoRA configs and use for LoRA too.

bash scripts/finetune_video.sh

If you run out of vram, you can use zero3_offload instead of zero3. However, using zero3 is preferred.

Merge LoRA Weights

bash scripts/merge_lora.sh

Note: Remember to replace the paths in finetune.sh or finetune_lora.sh with your specific paths. (Also in merge_lora.sh when using LoRA.)

Issue for libcudnn error

Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8

You could run unset LD_LIBRARY_PATH for this error. You could see this issue

TODO

Known Issues

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

Citation

If you find this repository useful in your project, please consider giving a :star: and citing:

@misc{Llama3.2-Vision-Finetuning,
  author = {Yuwon Lee},
  title = {Llama3.2-Vision-Finetune},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/2U1/Llama3.2-Vision-Ft}
}

Acknowledgement

This project is based on