Scaling to 1M Users

Architecture for high-scale agent systems: horizontal scaling, queue-based execution, state partitioning, and managing LLM rate limits across a fleet.

Quick Reference

→Use a queue-based architecture: API server enqueues agent tasks, worker pool dequeues and executes
→Partition state by user/tenant ID across multiple Redis/Postgres instances to avoid single-node bottlenecks
→Implement request coalescing: batch similar queries to reduce redundant LLM calls
→Use connection pooling and rate limit coordination across workers to stay within provider API limits
→Deploy in multiple regions to reduce latency — LLM API latency is already high, don't add network latency on top

Queue Architecture

Decouple ingestion from execution

A queue-based architecture separates the API server (accepts requests fast) from the worker pool (executes agent runs slowly). This lets you scale each layer independently and absorb traffic spikes without dropping requests.

Scaling arch: queue-backed workers, rate limiting, shared Redis state

Queue-based agent execution with Celery + Redis

▸Use priority queues to differentiate between real-time chat (high) and batch processing (low)
▸Set task timeouts (soft_time_limit) to prevent hung agents from blocking workers
▸Implement result polling or WebSocket push so clients get responses without long-polling

State Partitioning

A single Redis instance handles approximately 100K concurrent connections before performance degrades. For 1M users, you need to partition state across multiple instances. Consistent hashing on user/tenant ID routes each user to a specific partition deterministically.

Worker Scaling

Scale on queue depth, not CPU

Agent workloads are I/O bound (waiting for LLM API responses), not CPU bound. CPU-based auto-scaling will not trigger during LLM latency spikes. Queue depth directly measures work backlog.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.