Advanced10 min
Rate Limiting & Quota Management
Managing LLM API rate limits across a fleet of agents: request queuing, token bucket algorithms, graceful degradation, and model fallback chains.
Quick Reference
- →LLM providers enforce rate limits on requests per minute (RPM) and tokens per minute (TPM) — you need to manage both
- →Use a centralized rate limiter (Redis-based token bucket) shared across all worker instances to avoid exceeding provider limits
- →Implement request queuing: when the rate limit is reached, queue requests with priority ordering instead of rejecting them
- →Model fallback chains: if the primary model is rate-limited, fall back to a secondary model (e.g., Claude Sonnet 4.6 → Haiku 4.5)
- →Set per-user quotas to prevent a single power user from consuming the entire rate limit budget
Rate Limit Landscape
Two dimensions
LLM providers enforce limits on both requests per minute (RPM) and tokens per minute (TPM). Hitting either one triggers throttling — you must track both simultaneously.
Rate limits vary dramatically by provider, tier, and model. A fleet of 20 agent workers sharing a single API key will collectively exhaust even generous limits within seconds during traffic spikes.
| Provider | Model Tier | RPM | TPM | Daily Limit | Notes |
|---|---|---|---|---|---|
| Anthropic | Claude Sonnet (Tier 1) | 50 | 40,000 | None | Increases with spend history |
| Anthropic | Claude Sonnet (Tier 4) | 4,000 | 400,000 | None | Requires $100+ monthly spend |
| OpenAI | GPT-5.4 (Tier 1) | 500 | 30,000 | None | Based on payment history |
| OpenAI | GPT-5.4 (Tier 5) | 10,000 | 12,000,000 | None | Requires $1,000+ spend |
| Gemini 3.1 Pro (Free) | 2 | 32,000 | 50 req | Strict daily caps | |
| Gemini 3.1 Pro (Paid) | 1,000 | 4,000,000 | None | Pay-as-you-go |
Shared keys
All workers using the same API key share one rate limit pool. Twenty workers each sending 5 RPM = 100 RPM total against a 50 RPM limit.