Awesome
<!-- markdownlint-disable first-line-h1 --> <!-- markdownlint-disable html --> <div align="center"> <img src="assets/logo.jpg" width="390"/> <div> </div> <div align="center"> <b><font size="5">project website</font></b> <sup> <a href="https://space.bilibili.com/3493095748405551?spm_id_from=333.337.search-card.all.click"> <i><font size="4">HOT</font></i> </a> </sup> <b><font size="5">PKU-Alignment Team</font></b> <sup> <a href="https://space.bilibili.com/3493095748405551?spm_id_from=333.337.search-card.all.click"> <i><font size="4">welcome</font></i> </a> </sup> </div> <div> </div>📘Documentation | 🆕Update News | 🛠️Quick Start | 🚀Algorithms | 👀Evaluation | 🤔Reporting Issues
</div> <div align="center">Our 100K Instruction-Following Datasets
</div>Align-Anything aims to align any modality large models (any-to-any models), including LLMs, VLMs, and others, with human intentions and values. More details about the definition and milestones of alignment for Large Models can be found in AI Alignment. Overall, this framework has the following characteristics:
- Highly Modular Framework. Its versatility stems from the abstraction of different algorithm types and well-designed APIs, allowing users to easily modify and customize the code for different tasks.
- Support for Various Model Fine-Tuning. This framework includes fine-tuning capabilities for models such as LLaMA3.1, LLaVA, Gemma, Qwen, Baichuan, and others (see Model Zoo).
- Support Fine-Tuning across Any Modality. It supports fine-tuning alignments for different modality model, including LLMs, VLMs, and other modalities (see Development Roadmap).
- Support Different Alignment Methods. The framework supports different alignment algorithms, including SFT, DPO, PPO, and others.
<details><summary>prompt</summary>Small white toilet sitting in a small corner next to a wall.</details> | <details><summary>prompt</summary>A close up of a neatly made bed with two night stands</details> | <details><summary>prompt</summary>A pizza is sitting on a plate at a restaurant.</details> | <details><summary>prompt</summary>A girl in a dress next to a piece of luggage and flowers.</details> | |
---|---|---|---|---|
Before Alignment (Chameleon-7B) | <img src="https://github.com/Gaiejj/align-anything-images/blob/main/chameleon/before/1.png?raw=true" alt="Image 8" style="max-width: 100%; height: auto;"> | <img src="https://github.com/Gaiejj/align-anything-images/blob/main/chameleon/before/2.png?raw=true" alt="Image 8" style="max-width: 100%; height: auto;"> | <img src="https://github.com/Gaiejj/align-anything-images/blob/main/chameleon/before/3.png?raw=true" alt="Image 8" style="max-width: 100%; height: auto;"> | <img src="https://github.com/Gaiejj/align-anything-images/blob/main/chameleon/before/4.png?raw=true" alt="Image 8" style="max-width: 100%; height: auto;"> |
After Alignment (Chameleon 7B Plus) | <img src="https://github.com/Gaiejj/align-anything-images/blob/main/chameleon/after/1.png?raw=true" alt="Image 8" style="max-width: 100%; height: auto;"> | <img src="https://github.com/Gaiejj/align-anything-images/blob/main/chameleon/after/2.png?raw=true" alt="Image 8" style="max-width: 100%; height: auto;"> | <img src="https://github.com/Gaiejj/align-anything-images/blob/main/chameleon/after/3.png?raw=true" alt="Image 8" style="max-width: 100%; height: auto;"> | <img src="https://github.com/Gaiejj/align-anything-images/blob/main/chameleon/after/4.png?raw=true" alt="Image 8" style="max-width: 100%; height: auto;"> |
Alignment fine-tuning can significantly enhance the instruction-following capabilities of large multimodal models. After fine-tuning, Chameleon 7B Plus generates images that are more relevant to the prompt.
Algorithms
We support basic alignment algorithms for different modalities, each of which may involve additional algorithms. For instance, in the text modality, we have also implemented SimPO, KTO, and others.
Modality | SFT | RM | DPO | PPO |
---|---|---|---|---|
Text -> Text (t2t) | ✔️ | ✔️ | ✔️ | ✔️ |
Text+Image -> Text (ti2t) | ✔️ | ✔️ | ✔️ | ✔️ |
Text+Image -> Text+Image (ti2ti) | ✔️ | ✔️ | ✔️ | ✔️ |
Text+Audio -> Text (ta2t) | ✔️ | ✔️ | ✔️ | ✔️ |
Text+Video -> Text (tv2t) | ✔️ | ✔️ | ✔️ | ✔️ |
Text -> Image (t2i) | ✔️ | ⚒️ | ✔️ | ⚒️ |
Text -> Video (t2v) | ✔️ | ⚒️ | ✔️ | ⚒️ |
Text -> Audio (t2a) | ✔️ | ⚒️ | ✔️ | ⚒️ |
Evaluation
We support evaluation datasets for Text -> Text
, Text+Image -> Text
and Text -> Image
.
Modality | Supported Benchmarks |
---|---|
t2t | ARC, BBH, Belebele, CMMLU, GSM8K, HumanEval, MMLU, MMLU-Pro, MT-Bench, PAWS-X, RACE, TruthfulQA |
ti2t | A-OKVQA, LLaVA-Bench(COCO), LLaVA-Bench(wild), MathVista, MM-SafetyBench, MMBench, MME, MMMU, MMStar, MMVet, POPE, ScienceQA, SPA-VL, TextVQA, VizWizVQA |
tv2t | MVBench, Video-MME |
ta2t | AIR-Bench |
t2i | ImageReward, HPSv2, COCO-30k(FID) |
t2v | ChronoMagic-Bench |
t2a | AudioCaps(FAD) |
- ⚒️ : coming soon.
News
- 2024-10-10: We support SFT for
Any -> Any
modality models Emu3. - 2024-09-24: We support SFT, DPO, RM and PPO for
Text + Video -> Text
modality models. - 2024-09-13: We support SFT, DPO, RM and PPO for
Text + Audio -> Text
modality models. - 2024-08-17: We support DPO and PPO for
Text+Image -> Text+Image
modality models. - 2024-08-15: We support a new function in the evaluation module: the
models_pk
script in here, which enables comparing the performance of two models across different benchmarks. - 2024-08-06: We restructure the framework to support any modality evaluation and the supported benchmark list is here.
- 2024-08-06: We support
Text+Image -> Text+Image
modality for the SFT trainer and Chameleon models.
- 2024-07-23: We support
Text -> Image
,Text -> Audio
, andText -> Video
modalities for the SFT trainer and DPO trainer. - 2024-07-22: We support the Chameleon model for the SFT trainer and DPO trainer!
- 2024-07-17: We open-source the Align-Anything-Instruction-100K dataset for text modality. This dataset is available in both English and Chinese versions, each sourced from different data sets and meticulously refined for quality by GPT-4.
- 2024-07-14: We open-source the align-anything framework.
Installation
# clone the repository
git clone git@github.com:PKU-Alignment/align-anything.git
cd align-anything
# create virtual env
conda create -n align-anything python==3.11
conda activate align-anything
[Optional]
We recommend installing CUDA in the conda environment and set the environment variable.
# We tested on the H800 computing cluster, and this version of CUDA works well.
# You can adjust this version according to the actual situation of the computing cluster.
conda install nvidia/label/cuda-12.2.0::cuda
export CUDA_HOME=$CONDA_PREFIX
If your CUDA installed in a different location, such as
/usr/local/cuda/bin/nvcc
, you can set the environment variables as follows:
export CUDA_HOME="/usr/local/cuda"
Fianlly, install align-anything
by:
pip install -e .
Wandb Logger
We support wandb
logging. By default, it is set to offline. If you need to view wandb logs online, you can specify the environment variables of WANDB_API_KEY
before starting the training:
export WANDB_API_KEY="..." # your W&B API key here
<!-- ## Install from Dockerfile
1. build docker image
```bash
FROM nvcr.io/nvidia/pytorch:24.02-py3
RUN echo "export PS1='[\[\e[1;33m\]\u\[\e[0m\]:\[\e[1;35m\]\w\[\e[0m\]]\$ '" >> ~/.bashrc
WORKDIR /root/align-anything
COPY . .
RUN python -m pip install --upgrade pip \
&& pip install -e .
```
then,
```bash
docker build --tag align-anything .
```
2. run the container
```bash
docker run -it --rm \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--mount type=bind,source=<host's mode path>,target=<docker's mode path> \
align-anything
``` -->
Quick Start
Training Scripts
To prepare for training, all scripts are located in the ./scripts
and parameters that require user input have been left empty. For example, the DPO scripts for Text + Image -> Text
modality is as follow:
MODEL_NAME_OR_PATH="" # model path
TRAIN_DATASETS="" # dataset path
TRAIN_TEMPLATE="" # dataset template
TRAIN_SPLIT="" # split the dataset
OUTPUT_DIR="" # output dir
source ./setup.sh # source the setup script
export CUDA_HOME=$CONDA_PREFIX # replace it with your CUDA path
deepspeed \
--master_port ${MASTER_PORT} \
--module align_anything.trainers.text_image_to_text.dpo \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--train_datasets ${TRAIN_DATASETS} \
--train_template SPA_VL \
--train_split train \
--output_dir ${OUTPUT_DIR}
We can run DPO with LLaVA-v1.5-7B (HF format) and Align-Anything-400K dataset using the follow script:
MODEL_NAME_OR_PATH="llava-hf/llava-1.5-7b-hf" # model path
TRAIN_DATASETS="PKU-Alignment/align-anything-400k" # dataset path
TRAIN_TEMPLATE="AA_TI2T" # dataset template
TRAIN_NAME="text-image-to-text" # dataset name
TRAIN_SPLIT="train" # split the dataset
OUTPUT_DIR="../output/dpo" # output dir
export WANDB_API_KEY="YOUR_WANDB_KEY" # wandb logging
source ./setup.sh # source the setup script
export CUDA_HOME=$CONDA_PREFIX # replace it with your CUDA path
deepspeed \
--master_port ${MASTER_PORT} \
--module align_anything.trainers.text_image_to_text.dpo \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--train_datasets ${TRAIN_DATASETS} \
--train_template ${TRAIN_TEMPLATE} \
--train_name ${TRAIN_NAME} \
--train_split ${TRAIN_SPLIT} \
--output_dir ${OUTPUT_DIR}
Evaluation
All evaluation scripts can be found in the ./scripts
. The ./scripts/evaluate.sh
script runs model evaluation on the benchmarks, and parameters that require user input have been left empty. The corresponding script is as follow:
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd "${SCRIPT_DIR}/../align_anything/evaluation" || exit 1
BENCHMARKS=("") # evaluation benchmarks
OUTPUT_DIR="" # output dir
GENERATION_BACKEND="" # generation backend
MODEL_ID="" # model's unique id
MODEL_NAME_OR_PATH="" # model path
CHAT_TEMPLATE="" # model template
for BENCHMARK in "${BENCHMARKS[@]}"; do
python __main__.py \
--benchmark ${BENCHMARK} \
--output_dir ${OUTPUT_DIR} \
--generation_backend ${GENERATION_BACKEND} \
--model_id ${MODEL_ID} \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--chat_template ${CHAT_TEMPLATE}
done
For example, you can evaluate LLaVA-v1.5-7B (HF format) on POPE and MM-SafetyBench benchmarks using the follow script:
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd "${SCRIPT_DIR}/../align_anything/evaluation" || exit 1
BENCHMARKS=("POPE" "MM-SafetyBench") # evaluation benchmarks
OUTPUT_DIR="../output/evaluation" # output dir
GENERATION_BACKEND="vLLM" # generation backend
MODEL_ID="llava-1.5-7b-hf" # model's unique id
MODEL_NAME_OR_PATH="llava-hf/llava-1.5-7b-hf" # model path
CHAT_TEMPLATE="Llava" # model template
for BENCHMARK in "${BENCHMARKS[@]}"; do
python __main__.py \
--benchmark ${BENCHMARK} \
--output_dir ${OUTPUT_DIR} \
--generation_backend ${GENERATION_BACKEND} \
--model_id ${MODEL_ID} \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--chat_template ${CHAT_TEMPLATE}
done
You can modify the configuration files for the benchmarks in this directory to suit specific evaluation tasks and models, and adjust inference parameters for vLLM or DeepSpeed based on your generation backend. For more details about the evaluation pipeline, refer to the here.
Inference
Interactive Client
python3 -m align_anything.serve.cli --model_name_or_path your_model_name_or_path
<img src="assets/cli_demo.gif" alt="cli_demo" style="width:600px;">
Interactive Arena
python3 -m align_anything.serve.arena \
--red_corner_model_name_or_path your_red_model_name_or_path \
--blue_corner_model_name_or_path your_blue_model_name_or_path
<img src="assets/arena_demo.gif" alt="arena_demo" style="width:600px;">
Report Issues
If you have any questions in the process of using align-anything, don't hesitate to ask your questions on the GitHub issue page, we will reply to you in 2-3 working days.
Citation
Please cite the repo if you use the data or code in this repo.
@misc{align_anything,
author = {PKU-Alignment Team},
title = {Align Anything: training all modality models to follow instructions with unified language feedback},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/PKU-Alignment/align-anything}},
}
License
align-anything is released under Apache License 2.0.