Home

Awesome

optillm

optillm is an OpenAI API compatible optimizing inference proxy which implements several state-of-the-art techniques that can improve the accuracy and performance of LLMs. The current focus is on implementing techniques that improve reasoning over coding, logical and mathematical queries. It is possible to beat the frontier models using these techniques across diverse tasks by doing additional compute at inference time.

Open in Spaces Open In Colab GitHub Discussions

Installation

Using pip

pip install optillm
optillm             
2024-10-22 07:45:05,612 - INFO - Loaded plugin: privacy
2024-10-22 07:45:06,293 - INFO - Loaded plugin: memory
2024-10-22 07:45:06,293 - INFO - Starting server with approach: auto

Install from source

Clone the repository with git and use pip install to setup the dependencies.

git clone https://github.com/codelion/optillm.git
cd optillm
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Set up the OPENAI_API_KEY environment variable (for OpenAI) or the AZURE_OPENAI_API_KEY, AZURE_API_VERSION and AZURE_API_BASE environment variables (for Azure OpenAI) or the AZURE_API_VERSION and AZURE_API_BASE environment variables and login using az login for Azure OpenAI with managed identity (see here).

You can then run the optillm proxy as follows.

python optillm.py
2024-09-06 07:57:14,191 - INFO - Starting server with approach: auto
2024-09-06 07:57:14,191 - INFO - Server configuration: {'approach': 'auto', 'mcts_simulations': 2, 'mcts_exploration': 0.2, 'mcts_depth': 1, 'best_of_n': 3, 'model': 'gpt-4o-mini', 'rstar_max_depth': 3, 'rstar_num_rollouts': 5, 'rstar_c': 1.4, 'base_url': ''}
 * Serving Flask app 'optillm'
 * Debug mode: off
2024-09-06 07:57:14,212 - INFO - WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8000
 * Running on http://192.168.10.48:8000
2024-09-06 07:57:14,212 - INFO - Press CTRL+C to quit

Usage

Once the proxy is running, you can use it as a drop in replacement for an OpenAI client by setting the base_url as http://localhost:8000/v1.

import os
from openai import OpenAI

OPENAI_KEY = os.environ.get("OPENAI_API_KEY")
OPENAI_BASE_URL = "http://localhost:8000/v1"
client = OpenAI(api_key=OPENAI_KEY, base_url=OPENAI_BASE_URL)

response = client.chat.completions.create(
  model="moa-gpt-4o",
  messages=[
    {
      "role": "user",
      "content": "Write a Python program to build an RL model to recite text from any position that the user provides, using only numpy."
    }
  ],
  temperature=0.2
)

print(response)

The code above applies to both OpenAI and Azure OpenAI, just remember to populate the OPENAI_API_KEY env variable with the proper key. There are multiple ways to control the optimization techniques, they are applied in the follow order of preference:

2024-09-06 08:35:32,597 - INFO - Using approach moa, with gpt-4o-mini
2024-09-06 08:35:35,358 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-06 08:35:39,553 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-06 08:35:44,795 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-06 08:35:44,797 - INFO - 127.0.0.1 - - [06/Sep/2024 08:35:44] "POST /v1/chat/completions HTTP/1.1" 200 -
response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[{ "role": "user","content": "" }],
  temperature=0.2,
  extra_body={"optillm_approach": "bon|moa|mcts"}
)
response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[{ "role": "user","content": "<optillm_approach>re2</optillm_approach> How many r's are there in strawberry?" }],
  temperature=0.2
)

[!TIP] You can also combine different techniques either by using symbols & and |. When you use & the techniques are processed in the order from left to right in a pipeline with response from previous stage used as request to the next. While, with | we run all the requests in parallel and generate multiple responses that are returned as a list.

Please note that the convention described above works only when the optillm server has been started with inference approach set to auto. Otherwise, the model attribute in the client request must be set with the model name only.

We now suport all LLM providers (by wrapping around the LiteLLM sdk). E.g. you can use the Gemini Flash model with moa by setting passing the api key in the environment variable os.environ['GEMINI_API_KEY'] and then calling the model moa-gemini/gemini-1.5-flash-002. In the output you will then see that LiteLLM is being used to call the base model.

9:43:21 - LiteLLM:INFO: utils.py:2952 - 
LiteLLM completion() model= gemini-1.5-flash-002; provider = gemini
2024-09-29 19:43:21,011 - INFO - 
LiteLLM completion() model= gemini-1.5-flash-002; provider = gemini
2024-09-29 19:43:21,481 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-002:generateContent?key=[redacted] "HTTP/1.1 200 OK"
19:43:21 - LiteLLM:INFO: utils.py:988 - Wrapper: Completed Call, calling success_handler
2024-09-29 19:43:21,483 - INFO - Wrapper: Completed Call, calling success_handler
19:43:21 - LiteLLM:INFO: utils.py:2952 - 
LiteLLM completion() model= gemini-1.5-flash-002; provider = gemini

[!TIP] optillm is a transparent proxy and will work with any LLM API or provider that has an OpenAI API compatible chat completions endpoint, and in turn, optillm also exposes the same OpenAI API compatible chat completions endpoint. This should allow you to integrate it into any existing tools or frameworks easily. If the LLM you want to use doesn't have an OpenAI API compatible endpoint (like Google or Anthropic) you can use LiteLLM proxy server that supports most LLMs.

The following sequence diagram illustrates how the request and responses go through optillm.

Sequance diagram showing optillm in use

In the diagram:

Local inference server

We support loading any HuggingFace model or LoRA directly in optillm. To use the built-in inference server set the OPTILLM_API_KEY to any value (e.g. export OPTILLM_API_KEY="optillm") and then use the same in your OpenAI client. You can pass any HuggingFace model in model field. If it is a private model make sure you set the HF_TOKEN environment variable with your HuggingFace key. We also support adding any number of LoRAs on top of the model by using the + separator.

E.g. The following code loads the base model meta-llama/Llama-3.2-1B-Instruct and then adds two LoRAs on top - patched-codes/Llama-3.2-1B-FixVulns and patched-codes/Llama-3.2-1B-FastApply. You can specify which LoRA to use using the active_adapter param in extra_args field of OpenAI SDK client. By default we will load the last specified adapter.

OPENAI_BASE_URL = "http://localhost:8000/v1"
OPENAI_KEY = "optillm"
response = client.chat.completions.create(
  model="meta-llama/Llama-3.2-1B-Instruct+patched-codes/Llama-3.2-1B-FastApply+patched-codes/Llama-3.2-1B-FixVulns",
  messages=messages,
  temperature=0.2,
  logprobs = True,
  top_logprobs = 3,
  extra_body={"active_adapter": "patched-codes/Llama-3.2-1B-FastApply"},
)

You can also use the alternate decoding techniques like cot_decoding and entropy_decoding directly with the local inference server.

response = client.chat.completions.create(
  model="meta-llama/Llama-3.2-1B-Instruct",
  messages=messages,
  temperature=0.2,
  extra_body={
        "decoding": "cot_decoding",  # or "entropy_decoding"
        # CoT specific params
        "k": 10,
        "aggregate_paths": True,
        # OR Entropy specific params
        "top_k": 27,
        "min_p": 0.03,
    }
)

Starting the optillm proxy with an external server (e.g. llama.cpp or ollama)

[!WARNING] Note that the Anthropic API, llama-server (and ollama) currently does not support sampling multiple responses from a model, which limits the available approaches to the following: cot_reflection, leap, plansearch, rstar, rto, self_consistency, re2, and z3. For models on HuggingFace, you can use the built-in local inference server as it supports multiple responses.

Implemented techniques

ApproachSlugDescription
CoT with Reflectioncot_reflectionImplements chain-of-thought reasoning with <thinking>, <reflection> and <output> sections
PlanSearchplansearchImplements a search algorithm over candidate plans for solving a problem in natural language
ReReadre2Implements rereading to improve reasoning by processing queries twice
Self-Consistencyself_consistencyImplements an advanced self-consistency method
Z3 Solverz3Utilizes the Z3 theorem prover for logical reasoning
R* AlgorithmrstarImplements the R* algorithm for problem-solving
LEAPleapLearns task-specific principles from few shot examples
Round Trip OptimizationrtoOptimizes responses through a round-trip process
Best of N SamplingbonGenerates multiple responses and selects the best one
Mixture of AgentsmoaCombines responses from multiple critiques
Monte Carlo Tree SearchmctsUses MCTS for decision-making in chat responses
PV GamepvgApplies a prover-verifier game approach at inference time
CoT DecodingN/A for proxyImplements chain-of-thought decoding to elicit reasoning without explicit prompting
Entropy DecodingN/A for proxyImplements adaptive sampling based on the uncertainy of tokens during generation

Implemented plugins

PluginSlugDescription
RouterrouterUses the optillm-bert-uncased model to route requests to different approaches based on the user prompt
MemorymemoryImplements a short term memory layer, enables you to use unbounded context length with any LLM
PrivacyprivacyAnonymize PII data in request and deanonymize it back to original value in response
Read URLsreadurlsReads all URLs found in the request, fetches the content at the URL and adds it to the context
Execute CodeexecutecodeEnables use of code interpreter to execute python code in requests and LLM generated responses

Available parameters

optillm supports various command-line arguments and environment variables for configuration.

ParameterDescriptionDefault Value
--approachInference approach to use"auto"
--simulationsNumber of MCTS simulations2
--explorationExploration weight for MCTS0.2
--depthSimulation depth for MCTS1
--best-of-nNumber of samples for best_of_n approach3
--modelOpenAI model to use"gpt-4o-mini"
--base-urlBase URL for OpenAI compatible endpoint""
--rstar-max-depthMaximum depth for rStar algorithm3
--rstar-num-rolloutsNumber of rollouts for rStar algorithm5
--rstar-cExploration constant for rStar algorithm1.4
--nNumber of final responses to be returned1
--return-full-responseReturn the full response including the CoT with <thinking> tagsFalse
--portSpecify the port to run the proxy8000
--optillm-api-keyOptional API key for client authentication to optillm""

When using Docker, these can be set as environment variables prefixed with OPTILLM_.

Running with Docker

optillm can optionally be built and run using Docker and the provided Dockerfile.

Using Docker Compose

  1. Make sure you have Docker and Docker Compose installed on your system.

  2. Either update the environment variables in the docker-compose.yaml file or create a .env file in the project root directory and add any environment variables you want to set. For example, to set the OpenAI API key, add the following line to the .env file:

    OPENAI_API_KEY=your_openai_api_key_here
    
  3. Run the following command to start optillm:

    docker compose up -d
    

    This will build the Docker image if it doesn't exist and start the optillm service.

  4. optillm will be available at http://localhost:8000.

When using Docker, you can set these parameters as environment variables. For example, to set the approach and model, you would use:

OPTILLM_APPROACH=mcts
OPTILLM_MODEL=gpt-4

To secure the optillm proxy with an API key, set the OPTILLM_API_KEY environment variable:

OPTILLM_API_KEY=your_secret_api_key

When the API key is set, clients must include it in their requests using the Authorization header:

Authorization: Bearer your_secret_api_key

SOTA results on benchmarks with optillm

readurls&memory-gpt-4o-mini on Google FRAMES Benchmark (Oct 2024)

ModelAccuracy
readurls&memory-gpt-4o-mini65.66
gpt-4o-mini50.0
readurls&memory-Gemma2-9b30.1
Gemma2-9b5.1
Gemma2-27b30.8
Gemini Flash 1.566.5
Gemini Pro 1.572.9

plansearch-gpt-4o-mini on LiveCodeBench (Sep 2024)

Modelpass@1pass@5pass@10
plansearch-gpt-4o-mini44.0359.3163.5
gpt-4o-mini43.950.6153.25
claude-3.5-sonnet51.3
gpt-4o-2024-05-1345.2
gpt-4-turbo-2024-04-0944.2

moa-gpt-4o-mini on Arena-Hard-Auto (Aug 2024)

Results showing Mixture of Agents approach using gpt-4o-mini on Arena Hard Auto Benchmark

optillm with Patchwork (July 2024)

Since optillm is a drop-in replacement for OpenAI API you can easily integrate it with existing tools and frameworks using the OpenAI client. We used optillm with patchwork which is an open-source framework that automates development gruntwork like PR reviews, bug fixing, security patching using workflows called patchflows. We saw huge performance gains across all the supported patchflows as shown below when using the mixture of agents approach (moa).

Results showing optillm mixture of agents approach used with patchflows

References