Production & Scale/Infrastructure
Advanced10 min

Rate Limiting & Quota Management

Managing LLM API rate limits across a fleet of agents: request queuing, token bucket algorithms, graceful degradation, and model fallback chains.

Quick Reference

  • LLM providers enforce rate limits on requests per minute (RPM) and tokens per minute (TPM) — you need to manage both
  • Use a centralized rate limiter (Redis-based token bucket) shared across all worker instances to avoid exceeding provider limits
  • Implement request queuing: when the rate limit is reached, queue requests with priority ordering instead of rejecting them
  • Model fallback chains: if the primary model is rate-limited, fall back to a secondary model (e.g., Claude Sonnet 4.6 → Haiku 4.5)
  • Set per-user quotas to prevent a single power user from consuming the entire rate limit budget

Rate Limit Landscape

Two dimensions

LLM providers enforce limits on both requests per minute (RPM) and tokens per minute (TPM). Hitting either one triggers throttling — you must track both simultaneously.

Rate limits vary dramatically by provider, tier, and model. A fleet of 20 agent workers sharing a single API key will collectively exhaust even generous limits within seconds during traffic spikes.

ProviderModel TierRPMTPMDaily LimitNotes
AnthropicClaude Sonnet (Tier 1)5040,000NoneIncreases with spend history
AnthropicClaude Sonnet (Tier 4)4,000400,000NoneRequires $100+ monthly spend
OpenAIGPT-5.4 (Tier 1)50030,000NoneBased on payment history
OpenAIGPT-5.4 (Tier 5)10,00012,000,000NoneRequires $1,000+ spend
GoogleGemini 3.1 Pro (Free)232,00050 reqStrict daily caps
GoogleGemini 3.1 Pro (Paid)1,0004,000,000NonePay-as-you-go
Shared keys

All workers using the same API key share one rate limit pool. Twenty workers each sending 5 RPM = 100 RPM total against a 50 RPM limit.