Home

Awesome

<p align="center"> <a href="https://github.com/predibase/lorax"> <img src="docs/images/lorax_guy.png" alt="LoRAX Logo" style="width:200px;" /> </a> </p> <div align="center">

LoRAX: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

License Artifact Hub

</div>

LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.

πŸ“– Table of contents

🌳 Features

<p align="center"> <img src="https://github.com/predibase/lorax/assets/29719151/f88aa16c-66de-45ad-ad40-01a7874ed8a9" /> </p>

🏠 Models

Serving a fine-tuned model with LoRAX consists of two components:

LoRAX supports a number of Large Language Models as the base model including Llama (including CodeLlama), Mistral (including Zephyr), and Qwen. See Supported Architectures for a complete list of supported base models.

Base models can be loaded in fp16 or quantized with bitsandbytes, GPT-Q, or AWQ.

Supported adapters include LoRA adapters trained using the PEFT and Ludwig libraries. Any of the linear layers in the model can be adapted via LoRA and loaded in LoRAX.

πŸƒβ€β™‚οΈ Getting Started

We recommend starting with our pre-built Docker image to avoid compiling custom CUDA kernels and other dependencies.

Requirements

The minimum system requirements need to run LoRAX include:

Launch LoRAX Server

Prerequisites

Install nvidia-container-toolkit Then

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/predibase/lorax:main --model-id $model

For a full tutorial including token streaming and the Python client, see Getting Started - Docker.

Prompt via REST API

Prompt base LLM:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
        "parameters": {
            "max_new_tokens": 64
        }
    }' \
    -H 'Content-Type: application/json'

Prompt a LoRA adapter:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
        "parameters": {
            "max_new_tokens": 64,
            "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
        }
    }' \
    -H 'Content-Type: application/json'

See Reference - REST API for full details.

Prompt via Python Client

Install:

pip install lorax-client

Run:

from lorax import Client

client = Client("http://127.0.0.1:8080")

# Prompt the base LLM
prompt = "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]"
print(client.generate(prompt, max_new_tokens=64).generated_text)

# Prompt a LoRA adapter
adapter_id = "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
print(client.generate(prompt, max_new_tokens=64, adapter_id=adapter_id).generated_text)

See Reference - Python Client for full details.

For other ways to run LoRAX, see Getting Started - Kubernetes, Getting Started - SkyPilot, and Getting Started - Local.

Chat via OpenAI API

LoRAX supports multi-turn chat conversations combined with dynamic adapter loading through an OpenAI compatible API. Just specify any adapter as the model parameter.

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:8080/v1",
)

resp = client.chat.completions.create(
    model="alignment-handbook/zephyr-7b-dpo-lora",
    messages=[
        {
            "role": "system",
            "content": "You are a friendly chatbot who always responds in the style of a pirate",
        },
        {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
    ],
    max_tokens=100,
)
print("Response:", resp.choices[0].message.content)

See OpenAI Compatible API for details.

Next steps

Here are some other interesting Mistral-7B fine-tuned models to try out:

You can find more LoRA adapters here, or try fine-tuning your own with PEFT or Ludwig.

πŸ™‡ Acknowledgements

LoRAX is built on top of HuggingFace's text-generation-inference, forked from v0.9.4 (Apache 2.0).

We'd also like to acknowledge Punica for their work on the SGMV kernel, which is used to speed up multi-adapter inference under heavy load.

πŸ—ΊοΈ Roadmap

Our roadmap is tracked here.