Production & Scale/Infrastructure
Advanced14 min

Rate Limiting & Quota Management

Build a production rate limiter for LLM agent fleets — from the decision of whether to build one at all, through Redis token buckets and priority queuing, to the cache-aware ITPM optimization that multiplies your effective throughput for free.

Quick Reference

  • LLM providers enforce three separate dimensions: requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM) — hitting any one triggers a 429
  • Anthropic's cache-aware ITPM: cached input tokens do NOT count toward ITPM limits. At 80% cache hit rate, your effective throughput is 5× the stated limit
  • A Redis token bucket shared across all workers is the only correct architecture — per-process tracking always overestimates available capacity
  • Priority queues turn 429 errors into delays: interactive requests get served before background jobs, never dropped
  • Model fallback chains degrade gracefully: primary rate-limited → secondary → tertiary, with per-model cooldowns after 429s
  • Parse provider response headers (retry-after, anthropic-ratelimit-*) to get exact backoff instead of guessing
  • Alert at 80% RPM/ITPM utilization — before you hit the wall, not after
  • Per-user quotas prevent a single runaway script from consuming the entire rate limit budget

Should You Build a Custom Rate Limiter?

Don't build this first

Most teams reach for a custom Redis rate limiter before they need one. If you have a single provider, fewer than 5 workers, and no priority ordering requirement, the SDK's built-in retry handles it. Only escalate when you have a genuine multi-worker or multi-provider coordination problem.

Need LLM rate limiting?Single provider +≤ 3 workers?YesSDK retry+ backoffDone ✓NoNeed priorityqueuing?Noaiolimiteror redistoken bucketlib ✓YesCustom Redis rate limitertoken bucket + priority queueMost teams stop hereSDK retrylibrary (simple)

Build a custom rate limiter only when you need priority queuing across multiple workers or providers

ScenarioRight solutionWhy
Single provider, 1–3 workersSDK retry + exponential backoffWorkers are unlikely to collectively exceed limits; coordination overhead isn't worth it
Single provider, 4–20 workers, no prioritiesaiolimiter or redis-py token bucket libExisting libraries handle the distributed case without custom Lua scripts
Multi-provider OR priority queuing neededCustom Redis rate limiter (this article)Libraries don't support cross-provider coordination or priority tiers
Background-only workloads (no SLA)Batch APIRuns off peak, separate rate limit pool, 50% cost discount on Anthropic
Batch API as a pressure valve

Anthropic's Message Batches API has its own rate limit pool separate from the Messages API. Background jobs (summarization, indexing, analytics) that can tolerate hours of latency should use the Batch API — they stop competing with interactive traffic entirely.