Production & Scale/Infrastructure
★ OverviewAdvanced12 min

Deployment Architectures

Choosing between serverless, containerized, and long-running deployment models for AI agents. Load balancing stateful agents, WebSocket vs SSE for streaming, and self-hosted infrastructure patterns.

Quick Reference

  • Serverless (Lambda, Cloud Functions) works for simple, short-lived agents but struggles with long-running graphs and state
  • Containerized (ECS, Cloud Run, Kubernetes) gives you control over memory, concurrency, and persistent connections
  • Long-running workers with a queue (SQS, Redis) decouple request ingestion from agent execution for independent scaling
  • Use SSE for server-to-client streaming (simpler, HTTP-native) and WebSockets only when you need bidirectional real-time communication
  • Sticky sessions or state externalization are required for stateful agents behind a load balancer

Deployment Models Overview

Three models

Production agents generally deploy as serverless functions, containers, or long-running workers — each with distinct tradeoffs around latency, cost, and operational complexity.

ServerlessContainerizedLong-Running WorkersrequestrequestrequestLambda / Functionevent-drivencold start< 30s timeoutauto-scale to zerono persistent stateDocker Containerwarm instancepre-warmedminutes - hourshorizontal scalinghealth checksQueue + Workerpersistentalways runninghours - daysdurable executioncheckpointed stateBest forSimple chatbots, quick Q&ABest forMulti-step agents, APIsBest forBatch jobs, data pipelinesincreasing complexity & duration

Deployment models: serverless vs containerized vs long-running workers

Serverless is cheapest at low traffic but hits hard limits on execution time and memory. Containers give you full control but require capacity planning. Long-running workers with queues decouple ingestion from execution and scale each independently.

ModelCold StartMax RuntimeState ManagementBest For
Serverless100-500ms5-15 minExternal onlySimple, short-lived agents
Containerized0ms (warm)UnlimitedLocal or externalMost production workloads
Long-running worker0ms (warm)UnlimitedExternal (queue-backed)High-throughput, async agents