Advanced14 min

Rate Limiting & Quota Management

Build a production rate limiter for LLM agent fleets — from the decision of whether to build one at all, through Redis token buckets and priority queuing, to the cache-aware ITPM optimization that multiplies your effective throughput for free.

Quick Reference

→LLM providers enforce three separate dimensions: requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM) — hitting any one triggers a 429
→Anthropic's cache-aware ITPM: cached input tokens do NOT count toward ITPM limits. At 80% cache hit rate, your effective throughput is 5× the stated limit
→A Redis token bucket shared across all workers is the only correct architecture — per-process tracking always overestimates available capacity
→Priority queues turn 429 errors into delays: interactive requests get served before background jobs, never dropped
→Model fallback chains degrade gracefully: primary rate-limited → secondary → tertiary, with per-model cooldowns after 429s
→Parse provider response headers (retry-after, anthropic-ratelimit-*) to get exact backoff instead of guessing
→Alert at 80% RPM/ITPM utilization — before you hit the wall, not after
→Per-user quotas prevent a single runaway script from consuming the entire rate limit budget

Should You Build a Custom Rate Limiter?

Don't build this first

Most teams reach for a custom Redis rate limiter before they need one. If you have a single provider, fewer than 5 workers, and no priority ordering requirement, the SDK's built-in retry handles it. Only escalate when you have a genuine multi-worker or multi-provider coordination problem.

Build a custom rate limiter only when you need priority queuing across multiple workers or providers

Scenario	Right solution	Why
Single provider, 1–3 workers	SDK retry + exponential backoff	Workers are unlikely to collectively exceed limits; coordination overhead isn't worth it
Single provider, 4–20 workers, no priorities	aiolimiter or redis-py token bucket lib	Existing libraries handle the distributed case without custom Lua scripts
Multi-provider OR priority queuing needed	Custom Redis rate limiter (this article)	Libraries don't support cross-provider coordination or priority tiers
Background-only workloads (no SLA)	Batch API	Runs off peak, separate rate limit pool, 50% cost discount on Anthropic

Batch API as a pressure valve

Anthropic's Message Batches API has its own rate limit pool separate from the Messages API. Background jobs (summarization, indexing, analytics) that can tolerate hours of latency should use the Batch API — they stop competing with interactive traffic entirely.

How LLM Rate Limits Actually Work

Three dimensions, not one

Providers enforce RPM, ITPM, and OTPM independently. Your system can hit the ITPM wall while RPM has plenty of headroom — a single large prompt can exhaust your token budget in one request.

Centralized Rate Limiter (Redis Token Bucket)

Single source of truth

One Redis key per provider per limit dimension. Every worker reads and writes the same key before calling the LLM. No exceptions — a worker that tracks limits locally will always overestimate what's available.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.