Home

Awesome

Text Inference Batcher

text-inference-batcher is a high-performance router optimized for maximum throughput in text inference workload.

Quick Start

Run in Container

There is an image host on ghcr.io

export UPSTREAMS="http://localhost:8080,http://localhost:8081" # List of OpenAI-compatible upstreams separated by comma
docker run --rm -it -p 8000:8000 -e UPSTREAMS=$UPSTREAMS ghcr.io/ialacol/text-inference-batcher-nodejs:latest # node.js version

Kubernetes

text-inference-batcher offers first class support for Kubernetes.

Quickly deploy two inference backend using ialacol in namespace llm.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
# the classic llama-2 13B
helm install llama-2 ialacol/ialacol \ 
  --set deployment.env.DEFAULT_MODEL_HG_REPO_ID="" \
  --set deployment.env.DEFAULT_MODEL_FILE="llama-2-13b-chat.ggmlv3.q4_0.bin" \
  -n llm
# orca mini fine-tuned llama-2 https://huggingface.co/psmathur/orca_mini_v3_13b
helm install orca-mini ialacol/ialacol \
  --set deployment.env.DEFAULT_MODEL_HG_REPO_ID="TheBloke/orca_mini_v3_13B-GGML" \
  --set deployment.env.DEFAULT_MODEL_HG_REPO_ID="orca_mini_v3_13b.ggmlv3.q4_0.bin" \
  -n llm
# just another fine-tuned variant
helm install stable-platypus2 ialacol/ialacol \
  --set deployment.env.DEFAULT_MODEL_HG_REPO_ID="TheBloke/Stable-Platypus2-13B-GGML" \
  --set deployment.env.DEFAULT_MODEL_HG_REPO_ID="stable-platypus2-13b.ggmlv3.q4_0.bin" \
  -n llm

Add text-inference-batcher pointing to upstreams.

helm repo add text-inference-batcher https://ialacol.github.io/text-inference-batcher
helm repo update
helm install tib text-inference-batcher/text-inference-batcher-nodejs \
  --set deployment.env.UPSTREAMS="http://llama-2:8000,http://orca-mini:8000,http://stable-platypus2:8000"
  -n llm

Port forward text-inference-batcher for testing.

kubectl port-forward svc/tib 8000:8000 -n llm

Single gateway for all your inference backends

openai -k "sk-" -b http://localhost:8000/v1 -vv api chat_completions.create -m llama-2-13b-chat.ggmlv3.q4_0.bin -g user "Hello world!"
openai -k "sk-" -b http://localhost:8000/v1 -vv api chat_completions.create -m orca_mini_v3_13b.ggmlv3.q4_0.bin -g user "Hello world!"
openai -k "sk-" -b http://localhost:8000/v1 -vv api chat_completions.create -m stable-platypus2-13b.ggmlv3.q4_0.bin -g user "Hello world!"

Features

Rationale

Continuous batching is a simple yet powerful technique to improve the throughput of text inference endpoints (ref). Maximizing "throughput" essentially means serving the maximum number of clients simultaneously. Batching involves queuing incoming requests and distributing them to a group of inference servers when they become available.

While there are existing projects that implement batching for inference, such as Triton, huggingface's text-generation-inference, and vllm's AsyncLLMEngine, there is currently no language-agnostic solution available.

text-inference-batcher aims to make batching more accessible and language-agnostic by leveraging the generic web standard, the HTTP interface. It brings simple yet powerful batching algorithms to any inference servers with an OpenAI Compatible API. The inference server, which handles the heavy lifting, can be written in any language and deployed on any infrastructure, as long as it exposes OpenAI-compatible endpoints to text-inference-batcher.

In addition to high throughput, as a router and load balancer in front of all the inference servers, text-inference-batcher offers additional features, including:

text-inference-batcher itself is written in TypeScript with an edge-first design. It can be deployed on Node.js, Cloudflare Workers, Fastly Compute@Edge, Deno, Bun, Lagon, and AWS Lambda.

Configuration

The following environmental variables are available

VariableDescriptionDefaultExample
UPSTREAMSA list of upstream, separated by comma.nullhttp://llama-2:8000,http://falcon:8000
MAX_CONNECT_PER_UPSTREAMThe max number of connection per upstream1666
WAIT_FORThe duration to wait for an upstream to become ready in ms5000 (5 secs)30000 (30 seconds)
TIMEOUTThe timeout of connection to upstream in ms600000 (10 mins)60000 (1 min)
DEBUGVerbose loggingfalsetrue
TIB_PORTListening port80008889

Terminology

Downstream

We are using the same definition of downstream from envoy or nginx. That is, a downstream host connects to text-inference-batcher, sends requests, and receives responses. For example, a Python app using OpenAI Python library to send requests to text-inference-batcher is a downstream.

Upstream

We are using the same definition of upstream from envoy or nginx. That is, an upstream host receives connections and requests from text-inference-batcher and returns responses. An OpenAI API compatible API server, for example ialacol is a upstream.

Batching Algorithm

In short, text-inference-batcher is asynchronous by default. It finds a free and healthy inference server to process requests or queues the request when all inference servers are busy. The queue is consumed when a free inference server becomes available.

Development

The repo is a monorepo managed by Turborepo, applications such as nodejs version of text-inference-batcher are in ./apps/*, packages are in ./packages/*

To install the dependencies

npm install

Start all applications in development mode

npm run dev

Container Image

docker build --file ./apps/text-inference-batcher-nodejs/Dockerfile -t tib:latest .
docker run --rm -p 8000:8000 tib:latest

Build, run and remove after it exits.

docker run --rm -it -p 8000:8000 $(docker build --file ./apps/text-inference-batcher-nodejs/Dockerfile -q .)