Production & Scale/Production Operations
★ OverviewAdvanced12 min

Shipping to Production

The complete checklist for taking an agent from prototype to production: infrastructure, monitoring, rollout strategies, and incident response.

Quick Reference

  • Production readiness checklist: rate limiting, auth, input validation, output guardrails, monitoring, and alerting
  • Use LangGraph Platform for managed deployment or self-host with Docker + Redis for state persistence
  • Implement canary deployments: route 5% of traffic to the new agent version, monitor, then ramp up
  • Set up alerting on key metrics: p95 latency, error rate, token usage per request, and tool failure rate
  • Keep a human escalation path for every agent — production agents must be able to say 'I need a human'

Production Readiness Checklist

Production readiness != feature complete

A production-ready agent is one that fails gracefully, not one that never fails. The checklist ensures you have handled the failure paths.

CategoryMust-HavePriority
Auth & AccessAPI key rotation, per-user rate limits, RBAC on agent actionsP0
Input ValidationSchema validation on all tool inputs, prompt injection detectionP0
Output GuardrailsPII filtering, hallucination detection, response length limitsP0
MonitoringStructured logging with trace IDs, latency percentiles, token usage dashboardsP0
AlertingError rate > 5%, p95 latency > 10s, cost per request spikeP0
Kill SwitchFeature flag to disable agent instantly without redeploymentP0
FallbackGraceful degradation to static response or human handoffP1
Load TestingValidated at 2x expected peak traffic with multi-turn conversationsP1
RunbookDocumented incident response for top 5 failure scenariosP1
ComplianceAudit logging for all LLM inputs/outputs, data retention policyP2

Work through the P0 items before any production traffic. P1 items should be completed before ramping past 10% of users. P2 items are required for regulated industries but recommended for all.