Awesome
ialacol (l-o-c-a-l-a-i)
đ§ being rewritten from Python to Rust/WebAssembly, see details https://github.com/chenhunghan/ialacol/pull/93
Introduction
ialacol (pronounced "localai") is a lightweight drop-in replacement for OpenAI API.
It is an OpenAI API-compatible wrapper ctransformers supporting GGML/GPTQ with optional CUDA/Metal acceleration.
ialacol is inspired by other similar projects like LocalAI, privateGPT, local.ai, llama-cpp-python, closedai, and mlc-llm, with a specific focus on Kubernetes deployment.
Features
- Compatibility with OpenAI APIs, compatible with langchain.
- Lightweight, easy deployment on Kubernetes clusters with a 1-click Helm installation.
- Streaming first! For better UX.
- Optional CUDA acceleration.
- Compatible with Github Copilot VSCode Extension, see Copilot
Supported Models
See Receipts below for instructions of deployments.
- LLaMa 2 variants, including OpenLLaMA, Mistral, openchat_3.5 and zephyr.
- StarCoder variants
- WizardCoder
- StarChat variants
- MPT-7B
- MPT-30B
- Falcon
And all LLMs supported by ctransformers.
UI
ialacol
does not have a UI, however it's compatible with any web UI that support OpenAI API, for example chat-ui after PR #541 merged.
Assuming ialacol
running at port 8000, you can configure chat-ui to use zephyr-7b-beta.Q4_K_M.gguf
served by ialacol
.
MODELS=`[
{
"name": "zephyr-7b-beta.Q4_K_M.gguf",
"displayName": "Zephyr 7B β",
"preprompt": "<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate.</s>\n",
"userMessageToken": "<|user|>\n",
"userMessageEndToken": "</s>\n",
"assistantMessageToken": "<|assistant|>\n",
"assistantMessageEndToken": "\n",
"parameters": {
"temperature": 0.1,
"top_p": 0.95,
"repetition_penalty": 1.2,
"top_k": 50,
"max_new_tokens": 4096,
"truncate": 999999
},
"endpoints" : [{
"type": "openai",
"baseURL": "http://localhost:8000/v1",
"completion": "chat_completions"
}]
}
]
MODELS=`[
{
"name": "openchat_3.5.Q4_K_M.gguf",
"displayName": "OpenChat 3.5",
"preprompt": "",
"userMessageToken": "GPT4 User: ",
"userMessageEndToken": "<|end_of_turn|>",
"assistantMessageToken": "GPT4 Assistant: ",
"assistantMessageEndToken": "<|end_of_turn|>",
"parameters": {
"temperature": 0.1,
"top_p": 0.95,
"repetition_penalty": 1.2,
"top_k": 50,
"max_new_tokens": 4096,
"truncate": 999999,
"stop": ["<|end_of_turn|>"]
},
"endpoints" : [{
"type": "openai",
"baseURL": "http://localhost:8000/v1",
"completion": "chat_completions"
}]
}
]`
Blogs
- Use Code Llama (and other open LLMs) as Drop-In Replacement for Copilot Code Completion
- Containerized AI before Apocalypse đŗđ¤
- Deploy Llama 2 AI on Kubernetes, Now
- Cloud Native Workflow for Private MPT-30B AI Apps
- Offline AI đ¤ on Github Actions đ ââī¸đ°
Quick Start
Kubernetes
ialacol
offer first class citizen support for Kubernetes, which means you can automate/configure everything compare to runing without.
To quickly get started with ialacol on Kubernetes, follow the steps below:
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama-2-7b-chat ialacol/ialacol
By defaults, it will deploy Meta's Llama 2 Chat model quantized by TheBloke.
Port-forward
kubectl port-forward svc/llama-2-7b-chat 8000:8000
Chat with the default model llama-2-7b-chat.ggmlv3.q4_0.bin
using curl
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "How are you?"}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false}' \
http://localhost:8000/v1/chat/completions
Alternatively, using OpenAI's client library (see more examples in the examples/openai
folder).
openai -k "sk-fake" \
-b http://localhost:8000/v1 -vvvvv \
api chat_completions.create -m llama-2-7b-chat.ggmlv3.q4_0.bin \
-g user "Hello world!"
Configuration
All configuration is done via environmental variable.
Parameter | Description | Default | Example |
---|---|---|---|
DEFAULT_MODEL_HG_REPO_ID | The Hugging Face repo id to download the model | None | TheBloke/orca_mini_3B-GGML |
DEFAULT_MODEL_HG_REPO_REVISION | The Hugging Face repo revision | main | gptq-4bit-32g-actorder_True |
DEFAULT_MODEL_FILE | The file name to download from the repo, optional for GPTQ models | None | orca-mini-3b.ggmlv3.q4_0.bin |
MODE_TYPE | Model type to override the auto model type detection | None | gptq , gpt_bigcode , llama , mpt , replit , falcon , gpt_neox gptj |
LOGGING_LEVEL | Logging level | INFO | DEBUG |
TOP_K | top-k for sampling. | 40 | Integers |
TOP_P | top-p for sampling. | 1.0 | Floats |
REPETITION_PENALTY | rp for sampling. | 1.1 | Floats |
LAST_N_TOKENS | The last n tokens for repetition penalty. | 1.1 | Integers |
SEED | The seed for sampling. | -1 | Integers |
BATCH_SIZE | The batch size for evaluating tokens, only for GGUF/GGML models | 8 | Integers |
THREADS | Thread number override auto detect by CPU/2, set 1 for GPTQ models | Auto | Integers |
MAX_TOKENS | The max number of token to generate | 512 | Integers |
STOP | The token to stop the generation | None | `< |
CONTEXT_LENGTH | Override the auto detect context length | 512 | Integers |
GPU_LAYERS | The number of layers to off load to GPU | 0 | Integers |
TRUNCATE_PROMPT_LENGTH | Truncate the prompt if set | 0 | Integers |
Sampling parameters including TOP_K
, TOP_P
, REPETITION_PENALTY
, LAST_N_TOKENS
, SEED
, MAX_TOKENS
, STOP
can be override per request via request body, for example:
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" }' \
http://localhost:8000/v1/chat/completions
will use temperature=2
, top_p=1
and top_k=0
for this request.
Run in Container
Image from Github Registry
There is a image hosted on ghcr.io (alternatively CUDA11,CUDA12,METAL,GPTQ variants).
docker run --rm -it -p 8000:8000 \
-e DEFAULT_MODEL_HG_REPO_ID="TheBloke/Llama-2-7B-Chat-GGML" \
-e DEFAULT_MODEL_FILE="llama-2-7b-chat.ggmlv3.q4_0.bin" \
ghcr.io/chenhunghan/ialacol:latest
From Source
For developers/contributors
Python
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
DEFAULT_MODEL_HG_REPO_ID="TheBloke/stablecode-completion-alpha-3b-4k-GGML" DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin" LOGGING_LEVEL="DEBUG" THREAD=4 uvicorn main:app --reload --host 0.0.0.0 --port 9999
Docker
Build image
docker build --file ./Dockerfile -t ialacol .
Run container
export DEFAULT_MODEL_HG_REPO_ID="TheBloke/orca_mini_3B-GGML"
export DEFAULT_MODEL_FILE="orca-mini-3b.ggmlv3.q4_0.bin"
docker run --rm -it -p 8000:8000 \
-e DEFAULT_MODEL_HG_REPO_ID=$DEFAULT_MODEL_HG_REPO_ID \
-e DEFAULT_MODEL_FILE=$DEFAULT_MODEL_FILE ialacol
GPU Acceleration
To enable GPU/CUDA acceleration, you need to use the container image built for GPU and add GPU_LAYERS
environment variable. GPU_LAYERS
is determine by the size of your GPU memory. See the PR/discussion in llama.cpp to find the best value.
CUDA 11
deployment.image
=ghcr.io/chenhunghan/ialacol-cuda11:latest
deployment.env.GPU_LAYERS
is the layer to off loading to GPU.
CUDA 12
deployment.image
=ghcr.io/chenhunghan/ialacol-cuda12:latest
deployment.env.GPU_LAYERS
is the layer to off loading to GPU.
Only llama
, falcon
, mpt
and gpt_bigcode
(StarCoder/StarChat) support CUDA.
Llama with CUDA12
helm install llama2-7b-chat-cuda12 ialacol/ialacol -f examples/values/llama2-7b-chat-cuda12.yaml
Deploys llama2 7b model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.
StarCoderPlus with CUDA12
helm install starcoderplus-guanaco-cuda12 ialacol/ialacol -f examples/values/starcoderplus-guanaco-cuda12.yaml
Deploys Starcoderplus-Guanaco-GPT4-15B-V1.0 model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.
CUDA Driver Issues
If you see CUDA driver version is insufficient for CUDA runtime version
when making the request, you are likely using a Nvidia Driver that is not compatible with the CUDA version.
Upgrade the driver manually on the node (See here if you are using CUDA11 + AMI). Or try different version of CUDA.
Metal
To enable Metal support, use the image ialacol-metal
built for metal.
deployment.image
=ghcr.io/chenhunghan/ialacol-metal:latest
For example
helm install llama2-7b-chat-metal ialacol/ialacol -f examples/values/llama2-7b-chat-metal.yaml.yaml
GPTQ
To use GPTQ, you must
deployment.image
=ghcr.io/chenhunghan/ialacol-gptq:latest
deployment.env.MODEL_TYPE
=gptq
For example
helm install llama2-7b-chat-gptq ialacol/ialacol -f examples/values/llama2-7b-chat-gptq.yaml.yaml
kubectl port-forward svc/llama2-7b-chat-gptq 8000:8000
openai -k "sk-fake" -b http://localhost:8000/v1 -vvvvv api chat_completions.create -m gptq_model-4bit-128g.safetensors -g user "Hello world!"
Tips
Copilot
ialacol
can be use as a copilot client as GitHub's Copilot is almost identical API as OpenAI completion API.
However, few things need to keep in mind:
-
Copilot client sends a lenthy prompt, to include all the related context for code completion, see copilot-explorer, which give heavy load on the server, if you are trying to run
ialacol
locally, opt-inTRUNCATE_PROMPT_LENGTH
environmental variable to truncate the prompt from the beginning to reduce the workload. -
Copilot sends request in parallel, to increase the throughput, you probably need a queue like text-inference-batcher.
Start two instances of ialacol:
gh repo clone chenhunghan/ialacol && cd ialacol && python3 -m venv .venv && source .venv/bin/activate && python3 -m pip install -r requirements.txt
LOGGING_LEVEL="DEBUG"
THREAD=2
DEFAULT_MODEL_HG_REPO_ID="TheBloke/stablecode-completion-alpha-3b-4k-GGML"
DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin"
TRUNCATE_PROMPT_LENGTH=100 # optional
uvicorn main:app --host 0.0.0.0 --port 9998
uvicorn main:app --host 0.0.0.0 --port 9999
Start tib, pointing to upstream ialacol instances.
gh repo clone ialacol/text-inference-batcher && cd text-inference-batcher && npm install
UPSTREAMS="http://localhost:9998,http://localhost:9999" npm start
Configure VSCode Github Copilot to use tib.
"github.copilot.advanced": {
"debug.overrideEngine": "stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin",
"debug.testOverrideProxyUrl": "http://localhost:8000",
"debug.overrideProxyUrl": "http://localhost:8000"
}
Creative v.s. Conservative
LLMs are known to be sensitive to parameters, the higher temperature
leads to more "randomness" hence LLM becomes more "creative", top_p
and top_k
also contribute to the "randomness"
If you want to make LLM be creative.
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" }' \
http://localhost:8000/v1/chat/completions
If you want to make LLM be more consistent and genereate the same result with the same input.
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "0.1", "top_p": "0.1", "top_k": "40" }' \
http://localhost:8000/v1/chat/completions
Roadmap
- Support
starcoder
model type via ctransformers, including: - Mimic restof OpenAI API, including
GET /models
andPOST /completions
- GPU acceleration (CUDA/METAL)
- Support
POST /embeddings
backed by huggingface Apache-2.0 embedding models such as Sentence Transformers and hkunlp/instructor - Suuport Apache-2.0 fastchat-t5-3b
- Support more Apache-2.0 models such as codet5p and others listed here
Star History
Receipts
Llama-2
Deploy Meta's Llama 2 Chat model quantized by TheBloke.
7B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-7b-chat ialacol/ialacol -f examples/values/llama2-7b-chat.yaml
13B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-13b-chat ialacol/ialacol -f examples/values/llama2-13b-chat.yaml
70B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-70b-chat ialacol/ialacol -f examples/values/llama2-70b-chat.yaml
OpenLM Research's OpenLLaMA Models
Deploy OpenLLaMA 7B model quantized by rustformers.
âšī¸ This is a base model, likely only useful for text completion.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-7b ialacol/ialacol -f examples/values/openllama-7b.yaml
VMWare's OpenLlama 13B Open Instruct
Deploy OpenLLaMA 13B Open Instruct model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-13b-instruct ialacol/ialacol -f examples/values/openllama-13b-instruct.yaml
Mosaic's MPT Models
Deploy MosaicML's MPT-7B model quantized by rustformers. âšī¸ This is a base model, likely only useful for text completion.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-7b ialacol/ialacol -f examples/values/mpt-7b.yaml
Deploy MosaicML's MPT-30B Chat model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-30b-chat ialacol/ialacol -f examples/values/mpt-30b-chat.yaml
Falcon Models
Deploy Uncensored Falcon 7B model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-7b ialacol/ialacol -f examples/values/falcon-7b.yaml
Deploy Uncensored Falcon 40B model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-40b ialacol/ialacol -f examples/values/falcon-40b.yaml
StarCoder Models (startcoder, startchat, starcoderplus, WizardCoder)
Deploy starchat-beta
model quantized by TheBloke.
helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install starchat-beta ialacol/ialacol -f examples/values/starchat-beta.yaml
Deploy WizardCoder
model quantized by TheBloke.
helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install wizard-coder-15b ialacol/ialacol -f examples/values/wizard-coder-15b.yaml
Pythia Models
Deploy light-weight pythia-70m
model with only 70 millions paramters (~40MB) quantized by rustformers.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install pythia70m ialacol/ialacol -f examples/values/pythia-70m.yaml
RedPajama Models
Deploy RedPajama
3B model
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install redpajama-3b ialacol/ialacol -f examples/values/redpajama-3b.yaml
StableLM Models
Deploy stableLM
7B model
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install stablelm-7b ialacol/ialacol -f examples/values/stablelm-7b.yaml
Development
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
pip freeze > requirements.txt