top of page
Writer's pictureThe Tech Platform

Quotas in Azure OpenAI Service: Best Practices and Limit Increase Requests

Quotas are like the rules that help Azure OpenAI Service run smoothly. They make sure that the various AI tools they offer are shared fairly and kept safe. In this article, we'll explain what quotas are, how to use them well, and what to do if you need more. Whether you're an experienced AI developer or just starting out, knowing how quotas work and following some good practices to stay within them is really important to make the most of this powerful service.


Table of Contents:

Default Quota Limit

Updated Quota Limit


What is Quota in Azure OpenAI Service?

Quota in Azure OpenAI Service is a feature that enables you to manage the rate at which your deployments can consume Azure OpenAI resources. Quota is assigned to your subscription on a per-region, per-model basis in units of Tokens-per-Minute (TPM). When you create a deployment, you assign it a TPM value. The total TPM assigned to all of your deployments in a region cannot exceed your quota for that region.


What is TPM?

TPM stands for Tokens-per-Minute. It is a unit of measurement used to define the rate at which Azure OpenAI resources can be consumed. When you create an Azure OpenAI deployment, you assign it a TPM value. The total TPM assigned to all of your deployments in a region cannot exceed your quota for that region.


TPM is used to control the cost and performance of Azure OpenAI deployments. By limiting the TPM assigned to a deployment, you can prevent it from consuming too many resources, which could lead to unexpected costs. You can also use TPM to ensure that all of your deployments have access to the resources they need, which can help to improve performance.


Here are some examples of how you can use quota to manage your Azure OpenAI resources:

  • You can use quota to limit the rate at which your deployments can generate text, translate languages, or answer questions.

  • You can use quota to prevent your deployments from consuming too many resources, which could lead to unexpected costs.

  • You can use quota to prioritize the use of Azure OpenAI resources for specific deployments or applications.

Benefits of using quota in Azure OpenAI Service:

  • Cost control: Quota can help you to control your costs by preventing your deployments from consuming too many resources.

  • Performance optimization: Quota can help you to optimize the performance of your applications by ensuring that they have access to the resources they need.

  • Fairness and equity: Quota can help to ensure that all users of Azure OpenAI Service have fair and equitable access to resources.



The Default Quota Limits

The default quota limits for Azure OpenAI in Azure AI services are as follows:

  • OpenAI resources per region per Azure subscription: 30

  • Default DALL-E quota limits: 2 concurrent requests

  • Maximum prompt tokens per request: Varies per model

  • Max fine-tuned model deployments: 2

  • Total number of training jobs per resource: 100

  • Max simultaneous running training jobs per resource: 1

  • Max training jobs queued: 20

  • Max Files per resource: 30

  • Total size of all files per resource: 1 GB

  • Max training job time (job will fail if exceeded): 720 hours

  • Max training job size (tokens in training file) x (# of epochs): 2 Billion

  • Max size of all files per upload (Azure OpenAI on your data): 16 MB


Updated Quota Limits

Below is the Updated Quota Limits table, providing a comprehensive overview of the latest allocations for our Azure OpenAI Service.

Model Name

Regions

TPM Limit

gpt-35-turbo

East US, South Central US, West Europe, France Central, and UK South regions.

240K

North Central US, Australia East, East US 2, Canada East, Japan East, Sweden Central, and Switzerland North regions

300K

gpt-35-turbo-16k

East US, South Central US, West Europe, France Central, and UK South regions

240K

North Central US, Australia East, East US 2, Canada East, Japan East, Sweden Central, and Switzerland North regions

300K

gpt-4

East US, South Central US, West Europe and France Central regions

20K

North Central US, Australia East, East US 2, Canada East, Japan East, UK South, Sweden Central, and Switzerland North regions

40K

gpt-4-32k

East US, South Central US, West Europe and France Central regions

60K

North Central US, Australia East, East US 2, Canada East, Japan East, UK South, Sweden Central, and Switzerland North regions

80K

test-embeddings-ada-002

East US, South Central US, West Europe and France Central regions

240K

North Central US, Australia East, East US 2, Canada East, Japan East, UK South regions, and Switzerland North regions

350K


How to Use Quota?

To use Quotas in Azure OpenAI Service, you need to create a resource. Once you have created an Azure OpenAI resource, you can view your current quota and usage for each region and model in the Azure portal.



How to Assign Quota to Deployment?

After creating the resource, click on the resource which you have created. When the resource is created, you will be able to see in the home page itself.


Follow the below steps to assign Quota to Deployment in Azure OpenAI:


STEP 1: Click on the resource you created.


STEP 2: Now, click on the "Deploy".

Azure OpenAI service - Click deploy

STEP 3: To assign a quota to a new deployment, click "+ Create new deployment".


Azure OpenAI service - Set the TPM limit


STEP 4: If you want to assign quota to the existing deployment, click "Edit deployment".


Azure OpenAI service - Click Edit deployment

A dialog box will appear. Set the TPM Limit.

Azure OpenAI service - set the limit

Click "Save and close".


How to Send Request to Increase the Quota Limits

There are several reasons why you might need to increase your quota limit in Azure OpenAI Service:

  • Increased demand: If you are seeing increased demand for your Azure OpenAI deployments, you may need to increase your quota limit to ensure that your deployments can meet the demand.

  • New applications: If you are developing new applications that use Azure OpenAI, you may need to increase your quota limit to support the development and testing of these applications.

  • Research and development: If you are using Azure OpenAI for research and development, you may need to increase your quota limit to support your experiments.

  • Unexpected usage: If you are experiencing unexpected usage of your Azure OpenAI resources, you may need to increase your quota limit to prevent your deployments from being throttled.

Send a request to increase the quota limit in azure OpenAI service

Below are the simple steps to send a request to increase the quota limit in Azure OpenAI Service:


STEP 1: Navigate to Azure OpenAI Studio and click on "Quotas".


STEP 2: Now, select the model for which you want to increase the Quota limit in Azure OpenAI Service. Click on the "Request quota" icon.

Azure OpenAI service - Send request

STEP 3: Now, enter the following details:

  1. Have you already onboarded to Azure OpenAI Service?

  2. Application ID from original OpenAI Service application (Optional).

  3. Your First Name

  4. Your Last Name

  5. Your Company Email Address

  6. Your Company Name

  7. Enter your subscription Id for the quota being requested.

  8. Select the region where you require increased quota.

  9. Select the model where you require increased quota

  10. Enter required TPM quota

  11. Describe your scenario to justify the quota increase for the model selected.

Azure OpenAI service - fill form 1

Azure OpenAI service - fill form 2

Azure OpenAI service - fill form 3

Azure OpenAI service - fill form 4

Azure OpenAI service - fill form 5

Azure OpenAI service - fill form 6

STEP 4: Click the "Submit" button.


Best Practices to remain in limits

Stick to these recommended guidelines to stay within your quota limits in Azure OpenAI Service:


Best Practice 1: Implement retry logic in your application

If a request fails due to hitting a rate limit, your application should be programmed to automatically retry the request after a certain amount of time.


Let’s say you have a weather app that makes API calls to get the current weather information. If the app hits a rate limit and receives a 429 Too Many Requests response, it could be programmed to wait for a few seconds and then try the request again. This ensures that the app can still get the weather information even if it temporarily hits a rate limit.


Best Practice 2: Avoid sharp changes in the workload

You should avoid suddenly increasing the number of requests your application is making. Suppose you have an e-commerce website that typically gets 100 orders per minute. During a big sale, you might expect this to increase. Instead of allowing it to suddenly jump to 1000 orders per minute, you could use techniques like gradually increasing server capacity or using a queue system to handle the increased load over time.


Best Practice 3: Increase the workload gradually

This is related to the previous point. If you need to increase the number of requests your application is making, do it gradually over a period of time.


Following on from the previous example, instead of allowing the number of orders per minute to jump from 100 to 1000 during the big sale, you could increase it gradually. For example, you might increase it to 200 orders per minute in the first hour of the sale, then 300 orders per minute in the second hour, and so on.


Best Practice 4: Test different load increase patterns

This means that you should experiment with different ways of increasing the number of requests your application makes to find what works best for your specific situation.


In this case, you could experiment with different ways of handling increased load on your e-commerce website during the big sale. For example, you might find that increasing server capacity slowly over several hours results in fewer issues than increasing it quickly in a short burst.


Best Practice 5: Increase the quota assigned to your deployment

If you find that your application consistently needs to make more requests than your current quota allows, you can request an increase in your quota. For example, if your current quota is 1000 requests per minute but your application consistently needs to make 1200 requests per minute, you could request an increase in your quota to 1500 requests per minute.


Best Practice 6: Move quota from another deployment, if necessary

If you have multiple deployments and one is not using its full quota, you can reallocate some of its quota to another deployment that needs it.


For example, if Deployment A has a quota of 1000 requests per minute but is only using 500, and Deployment B has a quota of 1000 requests per minute but needs to make 1200 requests per minute, you could move 200 from Deployment A’s quota to Deployment B.


If you have multiple apps using the same API and one app is not using its full quota, you could reallocate some of its quota to another app that needs it. For example, if you have a second app that also provides weather information but has fewer users and is not using its full quota, you could move some of its quota over to your main weather app.


Conclusion

We've learned that quotas in Azure OpenAI Service are like the rules that keep everything in order. They help ensure fair and safe use of AI tools. We've also seen how to use them effectively and how to request more when needed. Remember, whether you're new to AI or a pro, understanding quotas and following best practices is key to getting the most out of this amazing service.

Comments


bottom of page