Shipping to Production

The risk management framework for taking an agent from prototype to production: eval gates, cost estimation, safe rollout, observability, and incident response.

Quick Reference

→Eval gate first: ≥70% task success on 100-query holdout, zero P0 failures — no exceptions
→Estimate cost before shipping: turns × tokens × price; set a hard token ceiling in code
→Shadow mode before canary: run agent on real traffic, discard responses, compare offline
→Use LangSmith Deployment for managed agent hosting or self-host with Docker + Redis for state
→Track agent-specific metrics: tokens/request, tool failure rate, escalation rate
→Version prompts like code — prompt regressions are the most common silent production failure
→Write the incident runbook and test the kill switch before the first incident

Pre-Launch Gates: What Must Pass Before Production

The eval gate is the step most teams skip

Ship without an eval suite and you have no baseline. No baseline means no regression detection. No regression detection means you'll find out about quality problems from users, not metrics.

Three gates must pass before any production traffic. The order is not arbitrary — each gate builds on evidence the previous one produced.

Gate	Pass criteria	Why it's non-negotiable
Eval suite	≥70% task success on 100-query holdout; zero P0 failures	Sets the regression baseline — you can't detect a decline without a floor to compare against
Cost estimate	Per-conversation cost calculated and accepted by stakeholders	Cost surprises arrive on invoices, not in logs; see next section
Infra checklist	Rate limiting, auth, kill switch, and fallback deployed and smoke-tested	These fail silently in staging and loudly in production

The 70% threshold is a starting point, not a universal standard. A customer support agent might need 90%+ on intent classification; an internal productivity tool might ship at 75%. What matters is that you define the threshold before shipping and treat any regression below it as a rollback trigger — not a warning to monitor.

LangSmith evaluations or pytest — both work

LangSmith's evaluation harness integrates directly with LangGraph and supports online scoring in production. If you prefer open-source, a pytest fixture that runs 100 golden queries and asserts pass rate works equally well. The tool matters less than having one before you ship.

Cost Estimation: Know What You'll Spend

The formula is simple. The surprise is in the numbers. Most teams underestimate by 2-3× because they base estimates on a single-turn demo, not a multi-turn production conversation with retries, tool calls, and a full system prompt repeated on every request.

Safe Rollout: Shadow → Canary → GA

Rollout gates — each stage must pass before advancing; any failure requires investigation, not a retry

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.