Shipping to Production
The risk management framework for taking an agent from prototype to production: eval gates, cost estimation, safe rollout, observability, and incident response.
Quick Reference
- →Eval gate first: ≥70% task success on 100-query holdout, zero P0 failures — no exceptions
- →Estimate cost before shipping: turns × tokens × price; set a hard token ceiling in code
- →Shadow mode before canary: run agent on real traffic, discard responses, compare offline
- →Use LangSmith Deployment for managed agent hosting or self-host with Docker + Redis for state
- →Track agent-specific metrics: tokens/request, tool failure rate, escalation rate
- →Version prompts like code — prompt regressions are the most common silent production failure
- →Write the incident runbook and test the kill switch before the first incident
Pre-Launch Gates: What Must Pass Before Production
Ship without an eval suite and you have no baseline. No baseline means no regression detection. No regression detection means you'll find out about quality problems from users, not metrics.
Three gates must pass before any production traffic. The order is not arbitrary — each gate builds on evidence the previous one produced.
| Gate | Pass criteria | Why it's non-negotiable |
|---|---|---|
| Eval suite | ≥70% task success on 100-query holdout; zero P0 failures | Sets the regression baseline — you can't detect a decline without a floor to compare against |
| Cost estimate | Per-conversation cost calculated and accepted by stakeholders | Cost surprises arrive on invoices, not in logs; see next section |
| Infra checklist | Rate limiting, auth, kill switch, and fallback deployed and smoke-tested | These fail silently in staging and loudly in production |
The 70% threshold is a starting point, not a universal standard. A customer support agent might need 90%+ on intent classification; an internal productivity tool might ship at 75%. What matters is that you define the threshold before shipping and treat any regression below it as a rollback trigger — not a warning to monitor.
LangSmith's evaluation harness integrates directly with LangGraph and supports online scoring in production. If you prefer open-source, a pytest fixture that runs 100 golden queries and asserts pass rate works equally well. The tool matters less than having one before you ship.