Awesome
πͺ reliableGPT: Stop failing customer requests for your LLM App π
β οΈ DEPRECATION WARNING: LiteLLM is our new home. Thank you for checking us out! β€οΈ
Use LiteLLM to 20x your throughput - load balance between Azure, OpenAI (litellm router docs)
from litellm import Router
model_list = [{ # list of model deployments
"model_name": "gpt-3.5-turbo", # model alias
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-v-2", # actual model name
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-functioncalling",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "gpt-3.5-turbo",
"api_key": os.getenv("OPENAI_API_KEY"),
}
}]
router = Router(model_list=model_list)
# openai.ChatCompletion.create replacement
response = await router.completion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response)
β‘οΈ Get 0 dropped requests for your LLM app in production β‘οΈ
When a request to your llm app fails, reliableGPT handles it by:
- Retrying with an alternate model - GPT-4, GPT3.5, GPT3.5 16k, text-davinci-003
- Retrying with a larger context window model for Context Window Errors
- Sending a Cached Response (using semantic similarity)
- Retry with a fallback API key for Invalid API Key errors
Community
- Join us on Discord or Email us at ishaan@berri.ai & krrish@berri.ai
- Talk to Founders: Learn more / get help onboarding: Meeting Scheduling Link
Getting Started
Step 1. pip install package
pip install reliableGPT
Step 2. The core package is 1 line of code
Integrating with OpenAI, Azure OpenAI, Langchain, LlamaIndex
from reliablegpt import reliableGPT
openai.ChatCompletion.create = reliableGPT(openai.ChatCompletion.create, user_email='ishaan@berri.ai')
Troubleshooting
If you experience failure, try
pip install reliableGPT==0.2.976
π Code Examples
How does it handle failures?
-
Specify a fallback strategy for handling failed requests: For instance, you can define
fallback_strategy=['gpt-3.5-turbo', 'gpt-4', 'gpt-3.5-turbo-16k', 'text-davinci-003']
, and if you hit an error then reliableGPT will retry with the specified models in the given order until it receives a valid response. This is optional, and reliableGPT also has a default strategy it uses. -
Specify backup tokens: Using your OpenAI keys across multiple servers - and just got one rotated? You can pass backup keys using
add_keys()
. We will store and go through these, in case any get keys get rotated by OpenAI. For security, we use special tokens, and enable you to delete all your keys (usingdelete_keys()
) as well. -
Context Window Errors: For context window errors it automatically retries your request with models with larger context windows
-
Caching If model fallback + retries fails - reliableGPT also provides caching (hosted - not in-memory). You can turn this on with
caching=True
. This also works for request timeout / task queue depth issues. This is optional, scroll down to learn more π.
Advanced Usage
Breakdown of params
Here's everything you can pass to reliableGPT
Parameter | Type | Required/Optional | Description |
---|---|---|---|
openai.ChatCompletion.create | OpenAI method | Required | This is a method from OpenAI, used for calling the OpenAI chat endpoints |
user_email | string/list | Required | Update you on spikes in errors. You can either set user_email to one email (as user_email = "ishaan@berri.ai") or multiple (as user_email = ["ishaan@berri.ai", "krrish@berri.ai"] if you want to send alerts to multiple emails |
fallback_strategy | list | Optional | You can define a custom fallback strategy of OpenAI models you want to try using. If you want to try one model several times, then just repeat that e.g. ['gpt-4', 'gpt-4', 'gpt-3.5-turbo'] will try gpt-4 twice before trying gpt-3.5-turbo |
model_limits_dir | dict | Optional | Note: Required if using queue_requests = True , For models you want to handle rate limits for set model_limits_dir = {"gpt-3.5-turbo": {"max_token_capacity": 1000000, "max_request_capacity": 10000}} You can find your account rate limits here: https://platform.openai.com/account/rate-limits |
user_token | string | Optional | Pass your user token if you want us to handle OpenAI Invalid Key Errors - we'll rotate through your stored keys (more on this below π) till we get one that works |
azure_fallback_strategy | List[string] | Optional | Pass your backup azure deployment/engine id's. In case your requests start failing we'll switch to one of these (if you also pass in a backup openai key, we'll try the Azure endpoints before the raw OpenAI ones) |
backup_openai_key | string | Optional | Pass your OpenAI API key if you're using Azure and want to switch to OpenAI in case your requests start failing |
caching | bool | Optional | Cache your openai responses, Used as backup in case model fallback fails or overloaded queue (if you're servers are being overwhelmed with requests, it'll alert you and return cached responses, so that customer requests don't get dropped) |
max_threads | int | Optional | Pass this in alongside caching=True , for it to handle the overloaded queue scenario |
π¨βπ¬ Use Cases
Use Caching around your Query Endpoint π₯
If you're seeing high-traffic and want to make sure all your users get a response, wrap your query endpoint with reliableCache. It monitors for high-thread utilization and responds with cached responses.
Step 1. Import reliableCache
from reliablegpt import reliableCache
Step 2. Initialize reliableCache
# max_threads: the maximum number of threads you've allocated for flask to run (by default this is 1).
# query_arg: the variable name you're using to pass the user query to your endpoint (Assuming this is in the params/args)
# customer_instance_arg: unique identifier for that customer's instance (we'll put all cached responses for that customer within this bucket)
# user_email: [REQUIRED] your user email - we will alert you when you're seeing high utilization
cache = reliableCache(max_threads=20, query_arg="query", customer_instance_arg="instance_id", user_email="krrish@berri.ai")
e.g. The number of threads for this flask app is 50
if __name__ == "__main__":
from waitress import serve
serve(app, host="0.0.0.0", port=4000, threads=50)
Step 3. Decorate your endpoint π
## Decorate your endpoint with cache.cache_wrapper, this monitors for ..
## .. high thread utilization and sends cached responses when that happens
@app.route("/test_func")
@cache.cache_wrapper
def test_fn():
# your endpoint logic
Switch between Azure OpenAI and raw OpenAI
If you're using Azure OpenAI and facing issues like Read/Request Timeouts, Rate limits, etc. you can use reliableGPT πͺ to fall back to the raw OpenAI endpoints if your Azure OpenAI endpoint fails
Step 1. Import reliableGPT
from reliablegpt import reliableGPT
Step 2. Set your backup openai token + [Optional] Set fallback strategy
Note: This is stored locally.
#Set the backup openai key
openai.ChatCompletion.create = reliableGPT(
openai.ChatCompletion.create,
user_email="krrish@berri.ai",
backup_openai_key=os.getenv("OPENAI_API_KEY"),
fallback_strategy=["gpt-4", "gpt-4-32k"],
verbose=True)
Step 3. Test with a bad Azure Key!
#bad key
openai.api_key = "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"
for question in list_questions:
response = openai.ChatCompletion.create(model="gpt-4", engine="chatgpt-test", messages=[{"role":"user", "content": "Hey! how's it going?"}])
print(response)
Handle overloaded server w/ Caching
If all else fails, reliableGPT will respond with previously cached responses. We store this in a Supabase table and use cosine similarity for similarity based retrieval. Why not in-memory cache? Because when we autoscale / push new updates to our server, we didn't want the cache to be wiped out.
Step 1. Import reliableGPT
from reliablegpt import reliableGPT
Step 2. Turn on caching
#Set the backup openai key
openai.ChatCompletion.create = reliableGPT(
openai.ChatCompletion.create,
user_email="krrish@berri.ai",
caching=True)
Optional: Pass your max threads
Tell reliableGPT what the maximum number of threads you have, handling your requests for you.
e.g. The number of threads for this flask app is 50
if __name__ == "__main__":
from waitress import serve
serve(app, host="0.0.0.0", port=4000, threads=50)
Tell reliableGPT what the maximum number of threads is - max_threads=50
#Set the backup openai key
openai.ChatCompletion.create = reliableGPT(
openai.ChatCompletion.create,
user_email="krrish@berri.ai",
caching=True,
max_threads=50)
Step 3. Test it
Check out ./reliablegpt/tests/test_Caching
We spin up a flask server, and then run a test script to run a set of questions against the flask server.
Handle rotated keys
Step 1. Add your keys
from reliablegpt import add_keys, delete_keys, reliableGPT
# Storing your keys π
user_email = "krrish@berri.ai" # π Replace with your email
token = add_keys(user_email, ["openai_key_1", "openai_key_2", "openai_key_3"])
Pass in a list of your openai keys. We will store these and go through them in case any get keys get rotated by OpenAI. You will get a special token, give that to reliableGPT.
Step 2. Initialize reliableGPT
import openai
openai.api_key = "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" ## Invalid OpenAI key
print("Initializing reliableGPT πͺ")
openai.ChatCompletion.create = reliableGPT(openai.ChatCompletion.create, user_email= user_email, user_token = token)
reliableGPTπͺ catches the Invalid API Key error thrown by OpenAI and rotates through the remaining keys to ensure you have zero downtime in production.
Step 3. Delete keys
#Deleting your keys from reliableGPT π«‘
delete_keys(user_email = user_email, user_token=token)
You own your keys, and can delete them whenever you want.
Support
Reach out to us on Discord or Email us at ishaan@berri.ai & krrish@berri.ai