Production & Scale/Production Operations
Advanced16 min

Prompt Management

Decide whether to build or buy a prompt management system, version prompts without blowing your cache budget, deploy changes through eval-gated CI/CD, A/B test with statistical rigor, and monitor for prompt drift before users notice.

Quick Reference

  • Build vs. buy: most teams should start with a platform (LangSmith Hub, Promptfoo, Braintrust) — build custom only for hard compliance requirements
  • Prompt versioning: every version needs content hash, author, change description, linked eval results, and a status lifecycle (draft → testing → active → deprecated)
  • Cache invalidation cost: changing a prompt invalidates Anthropic's 5-minute cache — at 10K req/hr on Sonnet 4.6, each prompt change costs ~$16 in cache misses during the refill window
  • Eval-gated CI/CD: run a 50-case smoke eval on every prompt PR; block merge if quality regresses below baseline
  • A/B testing: use a proportions z-test (p < 0.05), not a fixed improvement threshold — naive thresholds produce false positives ~30% of the time at typical sample sizes
  • Template variables need max_length limits and type validation — an unvalidated template variable is a prompt injection surface
  • Prompt drift: re-run your full eval suite weekly against the active prompt; prompts degrade as user patterns shift even when the text doesn't change
  • Rollback speed: the rollback path must be faster than the deployment path — under 10 seconds for registry-backed prompts

When You Need Prompt Management (and When You Don't)

Not every agent needs a prompt management system. A string constant in your codebase is fine when you have one developer, one prompt, and fewer than a few hundred daily users. The question is whether prompt iteration speed is bottlenecked by your deployment cycle — and whether a prompt change gone wrong can cause real damage before you catch it.

ScenarioPrompt CountChange FrequencyRecommendation
Solo dev, prototype, internal tool1–2InfrequentHardcode in source — don't over-engineer
Small team, single production agent3–8WeeklyConfig file or env vars — simple, no infrastructure
Multi-prompt production agent, multiple devs8–20DailyManaged platform (LangSmith Hub, Braintrust, PromptLayer)
Multi-team, shared prompts, compliance requirements20+ContinuousCustom registry — you need audit trails the platforms may not provide
The real trigger: needing a 2-minute fix without a 2-hour deploy

The moment you need to change a prompt because of a live production issue — and your only option is a full CI/CD cycle — is the moment you needed prompt management yesterday. It's not about frequency; it's about blast radius when iteration is slow.

↺ retryIterationLoop123456Identify Problemfailing case / user reportAdd Test Caseto golden suiteModify Prompton a branchRun Evaltemperature=0Compare & Deployif better, ship itMonitorlatency + accuracy

The prompt iteration loop — starts with a failing case, ends with monitoring