This blog focuses on good practices for monitoring Azure Open AI limits and quotas. With the growing interest and application of Generative AI, Open AI models have emerged as pioneers in this transformative era. To maintain consistent and predictable performance for all users, these models impose certain limits and quotas. For Independent Software Vendors (ISVs) and Digital Natives utilizing these models, understanding these limits and establishing efficient monitoring strategies is paramount to ensures a good customer experience to the end-users of their products and services. This blog seeks to provide a comprehensive understanding of these monitoring strategies, thereby enabling ISVs and Digital Natives to optimally leverage AI technologies for their respective customer bases.
Understanding Limits and Quotas
Azure OpenAI's quota feature enables assignment of rate limits to your deployments, up-to a global limit called your “quota”. Quota is assigned to your subscription on a per-region, per-model basis in units of **Tokens-per-Minute** (TPM). Your subscription is onboarded with a default quota for most models.
Refer to this document for default TPM values. You can allocate TPM among deployments until reaching quota. If you exceed a model's TPM limit in a region, you can reassign quota among deployments or request a quota increase. Alternatively, if viable, consider creating a deployment in a new Azure region in the same geography as the existing one.
For example, with a 240,000 TPM quota for GPT-35-Turbo in East US, you could create one deployment of 240K TPM, two of 120K TPM each, or multiple deployments adding up to less than 240K TPM in that region.
TPM rate limits are based on the maximum tokens **estimated** to be processed when the request is received. It is different than the token count used for billing, which is computed after all processing is completed. Azure OpenAI calculates a max processed-token count per request using:
- Prompt text and count
- The max_tokens setting
- The best_of setting
This estimated count is added to a running token count of all requests, which resets every minute. A 429 response code is returned once the TPM rate limit is reached within the minute.
A **Requests-Per-Minute** (RPM) rate limit is also enforced. It is set proportionally to the TPM assignment at a ratio of 6 RPM per 1000 TPM. If requests aren't evenly distributed over a minute, a 429 response may be received. Azure OpenAI Service evaluates incoming requests' rate over a short period, typically 1 or 10 seconds, and issues a 429 response if requests surpass the RPM limit. For example, if the service monitors with a 1-second interval, a 600-RPM deployment would be throttled if more than 10 requests are received per second.
In addition to the standard quota, there is also a provisioned throughput capability, or PTU. It is useful to think of the standard quota as a serverless mode, where your requests are served from a pool of resources and no capacity is reserved for you, hence the overall latency could vary. In contrast, with a provisioned throughput capability, you specify the amount of throughput you require for your application. The service then provisions the necessary compute and ensures it is ready for you. This gives you a more predictable performance and stable max latency. For high throughput workloads this may provide cost savings versus the token-based consumption. At the time of the writing, provisioned throughput units are not available by default. For more details about it, contact your Microsoft Account team.
There is also a limit of 30 Azure OpenAI resource instances per region. For an exhaustive and up-to-date list of quotas and limits please check this document. It is important to plan ahead on how you will manage and segregate tenant data and traffic in order to ensure reliable performance and optimal costs. Please check the Azure Open AI service specific guidance for considerations and strategies pertinent to multitenant solutions.
Choosing between tokens-per-minute and provisioned throughput models
To choose effectively between TPM and PTU you need to understand that there are minimum PTUs per deployment required. If your current usage is above the requirement and expected to grow, it might be more economically feasible to purchase provisioned capacity. In high token usage scenarios, this provides a lower per token price and stable max latency. It is important to understand that with PTUs, you are isolated and protected from the noisy neighbor problem of a SaaS application with shared resources. However, you can still experience higher than average latency caused by other factors, such as the total load you send to the service, length of the prompt and response, etc.
Use this Capacity Calculator to estimate your PTU requirements and reach out to your account team for the latest limits.
Effective Monitoring Techniques
Now that we understand better the limits and quotas of the service, let's discuss how to effectively monitor usage and set up alerts to be notified and take action when you reach the limits and quotas assigned.
Azure OpenAI service has metrics and logs available as part of the Azure Monitor capabilities. Metrics are available out of the box, at no additional cost. By default, a history of 30 days is kept. If you need to keep these metrics for longer, or route to a different destination, you can do so by enabling it in the Diagnostic settings.
Metrics are grouped into four categories:
- HTTP Requests dimensions: Model Name, Model Version, Deployment, Status Code, Stream Type, and Operation.
- Tokens-Based Usage: Active tokens, Generated Completions Tokens, Processed Inference and Prompt Tokens.
- PTU Utilization dimensions: Model Name, Model Version, Deployment, and Stream Type.
- Fine-tuning: Training Hours by Deployment and Training Hours by Model Name.
Additionally, each API response header contains the RateLimit-Global-Remaining and RateLimit-Global-Reset. And the response body contains a usage section with the prompt tokens, completion tokens, and total tokens values that shows the billing tokens per request.
The available logs in Azure OpenAI are Audit logs, Request and Response logs, and Trace Logs. Once you enable these through the Diagnostic settings, you can send these to a Log Analytics workspace, Storage account, Event Hub, or a partner solution. Keep in mind that using diagnostic settings and sending data to Azure Monitor Logs has other costs associated with it. For more information, see Azure Monitor Logs cost calculations and options.
My colleagues created an Azure Monitor Workbook that serves as a great baseline to start monitoring your Azure Open AI service logs and metrics.
Optimization Recommendations
Use LLMs for what they are best at - natural language understanding and fluent language generation. This means understanding that LLMs are boxes trying to predict the most likely next token and just because you could use an LLM for a task, it doesn't necessarily make it the most optimal tool for it.
1. Always start with can this be done in code? Are there existing libraries, tools, patterns that can perform the task? If yes, use those. These will probably be more performant and cost less.
Examples: use Azure AI Language service for key phrase extraction instead of the LLM; use standard libraries to do math operations, data aggregation, etc.
2. Control the size of the input prompt (e.g. set a limit on the user input field; in RAG, depending on scenario, restrict the number of relevant chunks sent to the LLM) and completion (with max_tokens and best_of).
3. Call the GPT models as few times as possible. Ensure you gather all the data you need to generate an optimal response, and only then call the model.
4. Use the cheapest model that gets the task done. This could mean using GPT 3.5 instead of GPT 4 for tasks where the cheapest model performs at an acceptable level.
Prevention and Response Strategies for Limit Exceeding
Here are some best practices and strategies to avoid rate limiting errors in a tokens per minute, i.e. Pay-As-You-Go model:
- Use minimum feasible values for max_tokens and best_of in your scenario. For instance, don't set a high max-tokens if expecting small responses.
- Manage your quota to allocate more TPM to high-traffic deployments and less to those with limited needs.
- Avoid sharp changes in the workload. Increase the workload gradually.
- Test different load increase patterns.
- Check the size of prompts against the model limits before sending the request to the Azure Open AI service. For example, for GPT-4 (8k), a max request token limit of 8,192 is supported. If your prompt is 10K in size, then this will fail, and also any subsequent retries would fail as well, consuming your quota.
- retrying with exponential backoff - in practice it means performing a short sleep when a rate limit error is hit, then retrying the unsuccessful request. If the request is still unsuccessful, the sleep length is increased and the process is repeated. Note that unsuccessful requests contribute to your per-minute limit, so continuously resending a request won’t work. This strategy is useful for real-time requests from users.
- batching requests - If you're hitting the limit on requests per minute, but have headroom on tokens per minute, you can increase your throughput by batching multiple tasks into each request. This will allow you to process more tokens per minute, especially with the smaller models.
- when handling batch processing, maximizing throughput matters more than latency. Proactively adding delay between batch requests can help. E.g., if your rate limit 20 requests per minute, add a delay of 3–6 seconds to each request. This can help you operate near the rate limit ceiling without hitting it and incurring wasted requests.
For more details on these strategies and an example of a parallel processing script, please see this notebook and documentation from Azure Open AI.
If your workload is particularly sensitive to latency and cannot tolerate latency spikes, you can consider implementing a mechanism that checks the latency of Azure Open AI in different Azure regions and send requests to the region with the smallest latency. You can group regions into geographies, like Americas, EMEA and Asia, and perform these checks on a per geography basis. This should also account for any compliance regulation and data residency requirements. For a more detailed walkthrough of this strategy, please check this blog.
In Azure, API Management (APIM) service can help you implement some of these best practices and strategies. APIM supports queueing, rate throttling, error handling, managing user quotas, as well as distributing requests to different Azure Open AI instances, potentially located in different regions to implement the pattern described above.
Conclusion
In conclusion, understanding the limits, quotas, and optimization techniques for Azure Open AI is crucial for effectively utilizing the service and achieving optimal performance and cost efficiency. By carefully monitoring usage, setting up alerts, and implementing prevention and response strategies for limit exceeding, you can ensure reliable performance and avoid unnecessary disruptions.
The insights and recommendations provided in this document serve as a valuable guide to help you make informed decisions and optimize your Azure Open AI use-cases. By following these best practices, such as leveraging existing libraries and tools, controlling input prompt size, minimizing API calls, and using the most cost-effective models, you can maximize the value and efficiency of your AI applications.
Remember to plan ahead, allocate resources wisely, and continuously monitor and adjust your usage based on the metrics and logs available through Azure Monitor. By doing so, you can proactively address any potential issues, avoid rate limiting errors, and deliver a seamless and responsive experience to your users.