Home

Awesome

<div align="right"> English | <a title="简体中文" href="./readme/README_zhcn.md">简体中文</a></a> </div>

Training Phi3-V with PEFT

This repository contains a script for training the Phi3-V model with Parameter-Efficient Fine-Tuning (PEFT) techniques using various configurations and options.

Table of Contents

Supported Features

Installation

Install the required packages using either requirements.txt or environment.yml.

Using requirements.txt

pip install -r requirements.txt

Using environment.yml

conda env create -f environment.yml
conda activate phi3v

Model Download

Before training, download the Phi3-V model from HuggingFace. It is recommended to use the huggingface-cli to do this.

  1. Install the HuggingFace CLI:
pip install -U "huggingface_hub[cli]"
  1. Download the model:
huggingface-cli download microsoft/Phi-3-vision-128k-instruct --local-dir Phi-3-vision-128k-instruct --resume-download
<!-- 3. Replace the modeling files: Replace the modeling files under `Phi-3-vision-128k-instruct` with the ones under `overwrites/Phi-3-vision-128k-instruct`. -->

Usage

To run the training script, use the following command:

bash scripts/train.sh

Note: Remember to replace the paths in train.sh with your specific paths.

Arguments

Dataset Preparation

The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder.

<details> <summary>Example Dataset</summary>
[
  {
    "id": "000000033471",
    "image": "000000033471.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
      },
      {
        "from": "gpt",
        "value": "The bus in the image is white and red."
      },
      {
        "from": "human",
        "value": "What feature can be seen on the back of the bus?"
      },
      {
        "from": "gpt",
        "value": "The back of the bus features an advertisement."
      },
      {
        "from": "human",
        "value": "Is the bus driving down the street or pulled off to the side?"
      },
      {
        "from": "gpt",
        "value": "The bus is driving down the street, which is crowded with people and other vehicles."
      }
    ]
  }
  ...
]
</details>

TODO

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

This project borrowed code from LLaVA and Microsoft Phi-3-vision-128k-instruct. Thanks to both projects for their contributions.

Citation

If you use this codebase in your work, please cite this project:

@misc{phi3vfinetuning2023,
  author = {Gai Zhenbiao & Shao Zhenwei},
  title = {Phi3V-Finetuning},
  year = {2023},
  publisher = {GitHub},
  url = {https://github.com/GaiZhenbiao/Phi3V-Finetuning},
  note = {GitHub repository},
}