Home

Awesome


page_type: sample languages:


<!-- YAML front-matter schema: https://review.learn.microsoft.com/en-us/help/contribute/samples/process/onboarding?branch=main#supported-metadata-fields-for-readmemd -->

:rocket: Smart Load Balancing for OpenAI Endpoints and Azure API Management

Smart APIM load balancing

Many service providers, including OpenAI, set limits on API calls. Azure OpenAI, for instance, has limits on tokens per minute (TPM) and requests per minute (RPM). Exceeding these limits results in a 429 'TooManyRequests' HTTP Status code and a 'Retry-After' header, indicating a pause before the next request.

This solution incorporates a comprehensive approach, considering UX/workflow design, application resiliency, fault-handling logic, appropriate model selection, API policy configuration, logging, and monitoring. It introduces an Azure API Management Policy that seamlessly integrates a single endpoint to your applications while efficiently managing consumption across multiple OpenAI or other API backends based on their availability and priority.

:sparkles: Smart vs. Round-Robin Load Balancers

Our solution stands out in its intelligent handling of OpenAI throttling. It is responsive to the HTTP status code 429 (Too Many Requests), a common occurrence due to rate limits in Azure OpenAI. Unlike traditional round-robin methods, our solution dynamically directs traffic to non-throttling OpenAI backends, based on a prioritized order. When a high-priority backend starts throttling, traffic is automatically rerouted to lower-priority backends until the former recovers.

Active mode

Throttling

Key Features:

Scenarios and Priority Groups:

In cases where multiple backends share the same priority and are all operational, our algorithm randomly selects among them.

Content Structure

Conclusion

This smart load balancing solution effectively addresses the challenges posed by API limit constraints in Azure OpenAI. By implementing the strategies outlined in the provided documentation, you can ensure efficient and reliable application performance, leveraging the full potential of your OpenAI and Azure API Management resources.

Productionizing the Solution

Transitioning to production requires careful consideration of security, performance, and cost. For a detailed guide on productizing this solution, including security enhancements, performance optimization, and continuous monitoring.

:link: Related articles