Oct 7, 20236 min read

Quotas in Azure OpenAI Service: Best Practices and Limit Increase Requests

Quotas are like the rules that help Azure OpenAI Service run smoothly. They make sure that the various AI tools they offer are shared fairly and kept safe. In this article, we'll explain what quotas are, how to use them well, and what to do if you need more. Whether you're an experienced AI developer or just starting out, knowing how quotas work and following some good practices to stay within them is really important to make the most of this powerful service.

What is Quota in Azure OpenAI Service?

Default Quota Limit

Updated Quota Limit

How to Use Quota in Azure OpenAI Service

How to Assign Quota to Deployment

How to send a request to increase the Quota Limit

Best Practice

Conclusion

What is Quota in Azure OpenAI Service?

Quota in Azure OpenAI Service is a feature that enables you to manage the rate at which your deployments can consume Azure OpenAI resources. Quota is assigned to your subscription on a per-region, per-model basis in units of Tokens-per-Minute (TPM). When you create a deployment, you assign it a TPM value. The total TPM assigned to all of your deployments in a region cannot exceed your quota for that region.

What is TPM?

TPM stands for Tokens-per-Minute. It is a unit of measurement used to define the rate at which Azure OpenAI resources can be consumed. When you create an Azure OpenAI deployment, you assign it a TPM value. The total TPM assigned to all of your deployments in a region cannot exceed your quota for that region.

TPM is used to control the cost and performance of Azure OpenAI deployments. By limiting the TPM assigned to a deployment, you can prevent it from consuming too many resources, which could lead to unexpected costs. You can also use TPM to ensure that all of your deployments have access to the resources they need, which can help to improve performance.

Here are some examples of how you can use quota to manage your Azure OpenAI resources:

You can use quota to limit the rate at which your deployments can generate text, translate languages, or answer questions.
You can use quota to prevent your deployments from consuming too many resources, which could lead to unexpected costs.
You can use quota to prioritize the use of Azure OpenAI resources for specific deployments or applications.

Benefits of using quota in Azure OpenAI Service:

Cost control: Quota can help you to control your costs by preventing your deployments from consuming too many resources.
Performance optimization: Quota can help you to optimize the performance of your applications by ensuring that they have access to the resources they need.
Fairness and equity: Quota can help to ensure that all users of Azure OpenAI Service have fair and equitable access to resources.

Also read: How to use the Azure OpenAI Service completion endpoint to generate text?

The Default Quota Limits

The default quota limits for Azure OpenAI in Azure AI services are as follows:

OpenAI resources per region per Azure subscription: 30
Default DALL-E quota limits: 2 concurrent requests
Maximum prompt tokens per request: Varies per model
Max fine-tuned model deployments: 2
Total number of training jobs per resource: 100
Max simultaneous running training jobs per resource: 1
Max training jobs queued: 20
Max Files per resource: 30
Total size of all files per resource: 1 GB
Max training job time (job will fail if exceeded): 720 hours
Max training job size (tokens in training file) x (# of epochs): 2 Billion
Max size of all files per upload (Azure OpenAI on your data): 16 MB

Updated Quota Limits

Below is the Updated Quota Limits table, providing a comprehensive overview of the latest allocations for our Azure OpenAI Service.

Model Name	Regions	TPM Limit
gpt-35-turbo	East US, South Central US, West Europe, France Central, and UK South regions.	240K
	North Central US, Australia East, East US 2, Canada East, Japan East, Sweden Central, and Switzerland North regions	300K
gpt-35-turbo-16k	East US, South Central US, West Europe, France Central, and UK South regions	240K
	North Central US, Australia East, East US 2, Canada East, Japan East, Sweden Central, and Switzerland North regions	300K
gpt-4	East US, South Central US, West Europe and France Central regions	20K
	North Central US, Australia East, East US 2, Canada East, Japan East, UK South, Sweden Central, and Switzerland North regions	40K
gpt-4-32k	East US, South Central US, West Europe and France Central regions	60K
	North Central US, Australia East, East US 2, Canada East, Japan East, UK South, Sweden Central, and Switzerland North regions	80K
test-embeddings-ada-002	East US, South Central US, West Europe and France Central regions	240K
	North Central US, Australia East, East US 2, Canada East, Japan East, UK South regions, and Switzerland North regions	350K

How to Use Quota?

To use Quotas in Azure OpenAI Service, you need to create a resource. Once you have created an Azure OpenAI resource, you can view your current quota and usage for each region and model in the Azure portal.

Link - Steps to create a resource

How to Assign Quota to Deployment?

After creating the resource, click on the resource which you have created. When the resource is created, you will be able to see in the home page itself.

Follow the below steps to assign Quota to Deployment in Azure OpenAI:

STEP 1: Click on the resource you created.

STEP 2: Now, click on the "Deploy".

STEP 3: To assign a quota to a new deployment, click "+ Create new deployment".

Azure OpenAI service - Set the TPM limit

STEP 4: If you want to assign quota to the existing deployment, click "Edit deployment".

Azure OpenAI service - Click Edit deployment

A dialog box will appear. Set the TPM Limit.

Click "Save and close".

How to Send Request to Increase the Quota Limits

There are several reasons why you might need to increase your quota limit in Azure OpenAI Service:

Increased demand: If you are seeing increased demand for your Azure OpenAI deployments, you may need to increase your quota limit to ensure that your deployments can meet the demand.
New applications: If you are developing new applications that use Azure OpenAI, you may need to increase your quota limit to support the development and testing of these applications.
Research and development: If you are using Azure OpenAI for research and development, you may need to increase your quota limit to support your experiments.
Unexpected usage: If you are experiencing unexpected usage of your Azure OpenAI resources, you may need to increase your quota limit to prevent your deployments from being throttled.

Below are the simple steps to send a request to increase the quota limit in Azure OpenAI Service:

STEP 1: Navigate to Azure OpenAI Studio and click on "Quotas".

STEP 2: Now, select the model for which you want to increase the Quota limit in Azure OpenAI Service. Click on the "Request quota" icon.

STEP 3: Now, enter the following details:

Have you already onboarded to Azure OpenAI Service?
Application ID from original OpenAI Service application (Optional).
Your First Name
Your Last Name
Your Company Email Address
Your Company Name
Enter your subscription Id for the quota being requested.
Select the region where you require increased quota.
Select the model where you require increased quota
Enter required TPM quota
Describe your scenario to justify the quota increase for the model selected.

STEP 4: Click the "Submit" button.

Best Practices to remain in limits

Stick to these recommended guidelines to stay within your quota limits in Azure OpenAI Service:

Best Practice 1: Implement retry logic in your application

If a request fails due to hitting a rate limit, your application should be programmed to automatically retry the request after a certain amount of time.

Let’s say you have a weather app that makes API calls to get the current weather information. If the app hits a rate limit and receives a 429 Too Many Requests response, it could be programmed to wait for a few seconds and then try the request again. This ensures that the app can still get the weather information even if it temporarily hits a rate limit.

Best Practice 2: Avoid sharp changes in the workload

You should avoid suddenly increasing the number of requests your application is making. Suppose you have an e-commerce website that typically gets 100 orders per minute. During a big sale, you might expect this to increase. Instead of allowing it to suddenly jump to 1000 orders per minute, you could use techniques like gradually increasing server capacity or using a queue system to handle the increased load over time.

Best Practice 3: Increase the workload gradually

This is related to the previous point. If you need to increase the number of requests your application is making, do it gradually over a period of time.

Following on from the previous example, instead of allowing the number of orders per minute to jump from 100 to 1000 during the big sale, you could increase it gradually. For example, you might increase it to 200 orders per minute in the first hour of the sale, then 300 orders per minute in the second hour, and so on.

Best Practice 4: Test different load increase patterns

This means that you should experiment with different ways of increasing the number of requests your application makes to find what works best for your specific situation.

In this case, you could experiment with different ways of handling increased load on your e-commerce website during the big sale. For example, you might find that increasing server capacity slowly over several hours results in fewer issues than increasing it quickly in a short burst.

Best Practice 5: Increase the quota assigned to your deployment

If you find that your application consistently needs to make more requests than your current quota allows, you can request an increase in your quota. For example, if your current quota is 1000 requests per minute but your application consistently needs to make 1200 requests per minute, you could request an increase in your quota to 1500 requests per minute.

Best Practice 6: Move quota from another deployment, if necessary

If you have multiple deployments and one is not using its full quota, you can reallocate some of its quota to another deployment that needs it.

For example, if Deployment A has a quota of 1000 requests per minute but is only using 500, and Deployment B has a quota of 1000 requests per minute but needs to make 1200 requests per minute, you could move 200 from Deployment A’s quota to Deployment B.

If you have multiple apps using the same API and one app is not using its full quota, you could reallocate some of its quota to another app that needs it. For example, if you have a second app that also provides weather information but has fewer users and is not using its full quota, you could move some of its quota over to your main weather app.

Conclusion

We've learned that quotas in Azure OpenAI Service are like the rules that keep everything in order. They help ensure fair and safe use of AI tools. We've also seen how to use them effectively and how to request more when needed. Remember, whether you're new to AI or a pro, understanding quotas and following best practices is key to getting the most out of this amazing service.

Quotas in Azure OpenAI Service: Best Practices and Limit Increase Requests

Table of Contents:

What is Quota in Azure OpenAI Service?

How to Use Quota in Azure OpenAI Service

How to Assign Quota to Deployment

How to send a request to increase the Quota Limit

Best Practice

Conclusion

What is Quota in Azure OpenAI Service?

The Default Quota Limits

Updated Quota Limits

How to Use Quota?

How to Assign Quota to Deployment?

How to Send Request to Increase the Quota Limits

Best Practices to remain in limits

Best Practice 1: Implement retry logic in your application

Best Practice 2: Avoid sharp changes in the workload

Best Practice 3: Increase the workload gradually

Best Practice 4: Test different load increase patterns

Best Practice 5: Increase the quota assigned to your deployment

Best Practice 6: Move quota from another deployment, if necessary

Conclusion

Recent Posts

Comments

Share

Learn

Ask

Contact US