OpenAI Service – Consider TPM Limits

Mar 10, 2025

Infracost

OpenAI Service – Consider TPM Limits

Mar 10, 2025

Infracost

Ensure that the OpenAI Tokens-Per-Minute (TPM) limits meet your organization’s specific requirements based on the model and SKU being used. For example, ‘gpt-4:Standard:10’ means that GPT-4 models using the Standard SKU should use a value of less than or equal to 10K TPM.

Managing TPM limits appropriately helps control costs, prevent unexpected billing spikes, and ensure consistent availability of AI resources for your applications.

Tokens-Per-Minute (TPM) limits determine how many tokens your application can send to OpenAI’s API within a minute. Think of tokens as pieces of words – most words in English are 1-2 tokens, with roughly 4 characters per token on average. These limits serve multiple purposes:

Cost control – By setting appropriate TPM limits, you prevent unintended overuse that could lead to excessive charges
Resource allocation – Ensures fair distribution of AI computing resources across your organization
Performance planning – Helps predict and manage application performance based on AI response needs
Without proper TPM limits, applications can potentially consume more tokens than anticipated, leading to significant unexpected costs, especially during traffic spikes or in case of application issues like infinite loops.

Cost Impact

TPM limits directly affect your OpenAI API costs in several ways:

Overprovisioning – Setting TPM limits much higher than needed means paying for unused capacity
Underprovisioning – Setting limits too low may impact application performance and user experience
Cost predictability – Appropriate limits make budgeting and forecasting more accurate

Potential Savings

Consider a scenario where you’re using GPT-4 with Standard SKU pricing:

TPM SettingMonthly Token UsageApproximate Monthly Cost 100K TPM4.32B tokens$86,40010K TPM432M tokens$8,6401K TPM43.2M tokens$864
Reducing an unnecessarily high TPM limit from 100K to 10K could save approximately $77,760 per month if your actual usage requirements are closer to the lower limit.
Even a more modest reduction from 50K to 40K TPM could save $17,280 monthly – significant savings that directly impact your bottom line.

Implementation Guide

Infrastructure-as-Code Example (Terraform)

Problematic configuration:

resource "azurerm_cognitive_account" "openai" {

name = "my-openai-service"

location = azurerm_resource_group.example.location
resource_group_name = azurerm_resource_group.example.name

kind = "OpenAI"

sku_name = "S0"

deployment {

name = "gpt-4-deployment"

model {

format = "OpenAI"

name = "gpt-4"

version = "0613"

}

scale {

type = "Standard"

capacity = 120 # 120K TPM - potentially excessive

}

Improved configuration:

resource "azurerm_cognitive_account" "openai" {

name = "my-openai-service"

location = azurerm_resource_group.example.location
resource_group_name = azurerm_resource_group.example.name

kind = "OpenAI"

sku_name = "S0"

deployment {

name = "gpt-4-deployment"

model {

format = "OpenAI"

name = "gpt-4"

version = "0613"

}

scale {

type = "Standard"

capacity = 10 # 10K TPM - more reasonable limit

}

Step-by-Step Instructions

For Terraform Configurations

Identify current TPM settings: Use Infracost to scan your infrastructure code for OpenAI deployments with potentially excessive TPM limits.
Evaluate actual usage requirements: Review application logs and OpenAI usage metrics to determine actual token consumption patterns.
Update configuration files: Modify the capacity value in your Terraform configuration to reflect appropriate limits.
Validate changes: Run terraform plan to verify the changes will be applied correctly.
Apply the changes: Use terraform apply to update your infrastructure.
Monitor impact: After implementation, track both performance and cost metrics to confirm the new limits are appropriate.
For Azure Portal (Manual Configuration)
Navigate to your Azure OpenAI resource in the Azure portal
Select “Model deployments” from the left menu
Click on the specific model deployment you want to adjust
Under “Capacity (TPM)”, select the appropriate value

Save your changes

Best Practices

Start conservatively: Begin with lower TPM limits and increase only as needed
Monitor usage patterns: Regularly review token usage to identify trends and adjust limits accordingly
Implement tiered limits: Consider different TPM limits for development, testing, and production environments
Create alerts: Set up monitoring alerts for when your usage approaches configured limits
Document decisions: Keep records of TPM limit decisions and their rationale for future reference
Regular reviews: Schedule quarterly reviews of TPM limits as part of your FinOps practice

Tools and Scripts

Infracost: Scan your infrastructure-as-code to identify potential cost optimization opportunities, including OpenAI TPM limits.
Azure Monitor: Create custom dashboards to track OpenAI usage against configured limits
Terraform state analysis: Use terraform state show commands to audit current configuration
PowerShell scripts: Automate the collection of usage statistics across multiple OpenAI deployments

Examples

Scenario 1: Development Environment Optimization
A development team initially set up a GPT-4 deployment with 50K TPM for their test environment. By analyzing actual usage patterns, they discovered peak usage never exceeded 2K TPM. By reducing the limit to 5K TPM (still providing headroom), they reduced their potential maximum monthly costs from $43,200 to $4,320.
Scenario 2: Production Scaling Strategy

An e-commerce company uses GPT-4 for customer service automation. Initially, they set a 100K TPM limit to handle potential traffic spikes. After implementing a more sophisticated scaling strategy with automated TPM adjustments based on time-of-day patterns, they reduced average TPM to 30K, saving approximately $50,400 monthly while maintaining service levels.

Scenario 3: Multi-Environment Management

A financial services organization maintained identical 80K TPM limits across development, staging, and production environments. By implementing environment-appropriate limits (5K for development, 20K for staging, 60K for production), they reduced overall OpenAI costs by 42% without affecting application performance.

Considerations and Caveats

When Higher TPM Limits May Be Justified
Mission-critical applications: Systems where AI response time is directly tied to user experience or business outcomes
High-traffic consumer applications: Services with unpredictable usage spikes that must maintain responsiveness
Batch processing workloads: Applications that need to process large volumes of data in short time windows

Implementation Challenges

Accurate forecasting: Predicting appropriate TPM limits requires good historical data
Application design: Some applications may need refactoring to handle TPM limits gracefully
Regional variations: Different regions may have different TPM limit availability
Model changes: Upgrading to newer model versions may require TPM limit reconsideration

Risk Mitigation Strategies

Implement circuit breakers: Design applications to degrade gracefully when approaching TPM limits
Queue-based architecture: Buffer requests during usage spikes for processing when capacity is available
Hybrid approaches: Consider using different models with different cost structures for various use cases
Auto-scaling policies: Some environments support dynamic TPM scaling based on usage patterns

Create Free Account

This policy is supported in Infracost and available in the free trial. Sign up today and scan your code using our entire library of FinOps policies.

Get started free