Home

Awesome

airoboros: using large language models to fine-tune large language models

This is my take on implementing the Self-Instruct paper. The approach is quite heavily modified, and does not use any human-generated seeds.

This updated implementation supports either the /v1/completions endpoint or /v1/chat/completions, which is particularly useful in that it supports gpt-4 and gpt-3.5-turbo (which is 1/10 the cost of text-davinci-003).

Huge thank you to the folks over at a16z for sponsoring the costs associated with building models and associated tools!

Install

via pip:

pip install --no-build-isolation airoboros

from source (keeping the source):

git clone https://github.com/jondurbin/airoboros
pip install -e --no-build-isolation ./airoboros

Key differences from self-instruct/alpaca

Goal of this project

Problem and proposed solution:

Progress:

LMoE

<img src="https://github.com/jondurbin/airoboros/blob/main/assets/lmoe.jpeg" alt="LMoE" width="300" class="center"/>

LMoE is the simplest architecture I can think of for a mixture of experts. It doesn't use a switch transformer, doesn't require slicing and merging layers with additional fine-tuning, etc. It just dynamically loads the best PEFT/LoRA adapter model based on the incoming request.

By using this method, we can theoretically crowdsource generation of dozens (or hundreds/thousands?) of very task-specific adapters and have an extremely powerful ensemble of models with very limited resources on top of a single base model (llama-2 7b/13b/70b).

Tuning the experts

The self-instruct code contained within this project uses many different "instructors" to generate training data to accomplish specific tasks. The output includes the instructor/category that generated the data. We can use this to automatically segment the training data to fine-tune specific "experts".

See scripts/segment_experts.py for an example of how the training data can be segmented, with a sampling of each other expert in the event of misrouting.

See scripts/tune_expert.py for an example of creating the adapter models (with positional args for expert name, model size, etc.)

NOTE: this assumes use of my fork of qlora https://github.com/jondurbin/qlora

Routing requests to the expert

The "best" routing mechanism would probably be to train a classifier based on the instructions for each category, with the category/expert being the label, but that prohibits dynamic loading of new experts.

Instead, this supports 3 options:

Running the API server

First, download the base llama-2 model for whichever model size you want, e.g.: llama-2-7b-hf

Next, download the LMoE package that corresponds to that base model, e.g.: airoboros-lmoe-7b-2.1

NOTE: 13b also available, 70b in progress

Here's an example command to start the server:

python -m airoboros.lmoe.api \
  --base-model ./llama-2-7b-hf \
  --lmoe ./airoboros-lmoe-7b-2.1 \
  --router-max-samples 1000 \
  --router-k 25 \
  --port 8000 \
  --host 127.0.0.1

to use the agent-based router, add --agent-router to the arguments

This uses flash attention via bettertransformers (in optimum). You may need to install torch nightly if you see an error like 'no kernel available', e.g.:

pip install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

Once started, you can infer using the same API scheme you'd query OpenAI API with, e.g.:

curl -H 'content-type: application/json' http://127.0.0.1:8000/v1/chat/completions -d '
{
  "model": "llama-2-7b-hf",
  "temperature": 0.7,
  "max_tokens": 2048,
  "messages": [
    {
      "role": "system",
      "content": "A chat."
    },
    {
      "role": "user",
      "content": "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
    }
  ]
}'

I've also added an vllm-based server, but the results aren't quite as good (not sure why yet). To use it, make sure you install vllm and fschat, or pip install airoboros[vllm]

python -m airoboros.lmoe.vllm \
  --model ./llama-2-7b-hf \
  --lmoe-path ../airoboros-lmoe-7b-2.1 \
  --router-max-samples 100 \
  --router-k 25 \
  --port 8000 \
  --host 127.0.0.1

Generating instructions

NEW - 2023-07-18

To better accommodate the plethora of options, the configuration has been moved to a YAML config file.

Please create a copy of example-config.yaml and configure as desired.

Once you have the desired configuration, run:

airoboros generate-instructions --config-path /path/to/config.yaml

Generating topics

NEW - 2023-07-18

Again, this is now all YAML configuration based! Please create a customized version of the YAML config file, then run:

airoboros generate-topics --config-path /path/to/config.yaml

You can override the topic_prompt string in the configuration to use a different topic generation prompt.

Support the work

https://bmc.link/jondurbin

ETH 0xce914eAFC2fe52FdceE59565Dd92c06f776fcb11

BTC bc1qdwuth4vlg8x37ggntlxu5cjfwgmdy5zaa7pswf

Models (research use only):

gpt-4 versions

llama-2 base model

2.1 dataset

2.0/m2.0

Previous generation (1.4.1 dataset)

original llama base model

Latest version (2.0 / m2.0 datasets)

Previous generation (1.4.1 dataset)

mpt-30b base model

gpt-3.5-turbo versions

Datasets