Production & Scale/Production Operations
★ OverviewAdvanced14 min

Shipping to Production

The risk management framework for taking an agent from prototype to production: eval gates, cost estimation, safe rollout, observability, and incident response.

Quick Reference

  • Eval gate first: ≥70% task success on 100-query holdout, zero P0 failures — no exceptions
  • Estimate cost before shipping: turns × tokens × price; set a hard token ceiling in code
  • Shadow mode before canary: run agent on real traffic, discard responses, compare offline
  • Use LangSmith Deployment for managed agent hosting or self-host with Docker + Redis for state
  • Track agent-specific metrics: tokens/request, tool failure rate, escalation rate
  • Version prompts like code — prompt regressions are the most common silent production failure
  • Write the incident runbook and test the kill switch before the first incident

Pre-Launch Gates: What Must Pass Before Production

The eval gate is the step most teams skip

Ship without an eval suite and you have no baseline. No baseline means no regression detection. No regression detection means you'll find out about quality problems from users, not metrics.

Three gates must pass before any production traffic. The order is not arbitrary — each gate builds on evidence the previous one produced.

GatePass criteriaWhy it's non-negotiable
Eval suite≥70% task success on 100-query holdout; zero P0 failuresSets the regression baseline — you can't detect a decline without a floor to compare against
Cost estimatePer-conversation cost calculated and accepted by stakeholdersCost surprises arrive on invoices, not in logs; see next section
Infra checklistRate limiting, auth, kill switch, and fallback deployed and smoke-testedThese fail silently in staging and loudly in production

The 70% threshold is a starting point, not a universal standard. A customer support agent might need 90%+ on intent classification; an internal productivity tool might ship at 75%. What matters is that you define the threshold before shipping and treat any regression below it as a rollback trigger — not a warning to monitor.

LangSmith evaluations or pytest — both work

LangSmith's evaluation harness integrates directly with LangGraph and supports online scoring in production. If you prefer open-source, a pytest fixture that runs 100 golden queries and asserts pass rate works equally well. The tool matters less than having one before you ship.