Rate Limiting & Quota Management
Build a production rate limiter for LLM agent fleets — from the decision of whether to build one at all, through Redis token buckets and priority queuing, to the cache-aware ITPM optimization that multiplies your effective throughput for free.
Quick Reference
- →LLM providers enforce three separate dimensions: requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM) — hitting any one triggers a 429
- →Anthropic's cache-aware ITPM: cached input tokens do NOT count toward ITPM limits. At 80% cache hit rate, your effective throughput is 5× the stated limit
- →A Redis token bucket shared across all workers is the only correct architecture — per-process tracking always overestimates available capacity
- →Priority queues turn 429 errors into delays: interactive requests get served before background jobs, never dropped
- →Model fallback chains degrade gracefully: primary rate-limited → secondary → tertiary, with per-model cooldowns after 429s
- →Parse provider response headers (retry-after, anthropic-ratelimit-*) to get exact backoff instead of guessing
- →Alert at 80% RPM/ITPM utilization — before you hit the wall, not after
- →Per-user quotas prevent a single runaway script from consuming the entire rate limit budget
Should You Build a Custom Rate Limiter?
Most teams reach for a custom Redis rate limiter before they need one. If you have a single provider, fewer than 5 workers, and no priority ordering requirement, the SDK's built-in retry handles it. Only escalate when you have a genuine multi-worker or multi-provider coordination problem.
Build a custom rate limiter only when you need priority queuing across multiple workers or providers
| Scenario | Right solution | Why |
|---|---|---|
| Single provider, 1–3 workers | SDK retry + exponential backoff | Workers are unlikely to collectively exceed limits; coordination overhead isn't worth it |
| Single provider, 4–20 workers, no priorities | aiolimiter or redis-py token bucket lib | Existing libraries handle the distributed case without custom Lua scripts |
| Multi-provider OR priority queuing needed | Custom Redis rate limiter (this article) | Libraries don't support cross-provider coordination or priority tiers |
| Background-only workloads (no SLA) | Batch API | Runs off peak, separate rate limit pool, 50% cost discount on Anthropic |
Anthropic's Message Batches API has its own rate limit pool separate from the Messages API. Background jobs (summarization, indexing, analytics) that can tolerate hours of latency should use the Batch API — they stop competing with interactive traffic entirely.