Production & Scale/Production Operations
Advanced14 min

Scaling to 1M Users

Architecture for high-scale agent systems: horizontal scaling, queue-based execution, state partitioning, and managing LLM rate limits across a fleet.

Quick Reference

  • Use a queue-based architecture: API server enqueues agent tasks, worker pool dequeues and executes
  • Partition state by user/tenant ID across multiple Redis/Postgres instances to avoid single-node bottlenecks
  • Implement request coalescing: batch similar queries to reduce redundant LLM calls
  • Use connection pooling and rate limit coordination across workers to stay within provider API limits
  • Deploy in multiple regions to reduce latency — LLM API latency is already high, don't add network latency on top

Queue Architecture

Decouple ingestion from execution

A queue-based architecture separates the API server (accepts requests fast) from the worker pool (executes agent runs slowly). This lets you scale each layer independently and absorb traffic spikes without dropping requests.

API Gatewayload balancerMessage Queueasync bufferingWorker PoolWorker 1Worker 2Worker 3scales horizontallyRate Limitertoken bucketLLM APIexternal serviceRedis State Storeshared state, checkpoints, locksread/write

Scaling arch: queue-backed workers, rate limiting, shared Redis state

Queue-based agent execution with Celery + Redis
  • Use priority queues to differentiate between real-time chat (high) and batch processing (low)
  • Set task timeouts (soft_time_limit) to prevent hung agents from blocking workers
  • Implement result polling or WebSocket push so clients get responses without long-polling