Prompt Management
Decide whether to build or buy a prompt management system, version prompts without blowing your cache budget, deploy changes through eval-gated CI/CD, A/B test with statistical rigor, and monitor for prompt drift before users notice.
Quick Reference
- →Build vs. buy: most teams should start with a platform (LangSmith Hub, Promptfoo, Braintrust) — build custom only for hard compliance requirements
- →Prompt versioning: every version needs content hash, author, change description, linked eval results, and a status lifecycle (draft → testing → active → deprecated)
- →Cache invalidation cost: changing a prompt invalidates Anthropic's 5-minute cache — at 10K req/hr on Sonnet 4.6, each prompt change costs ~$16 in cache misses during the refill window
- →Eval-gated CI/CD: run a 50-case smoke eval on every prompt PR; block merge if quality regresses below baseline
- →A/B testing: use a proportions z-test (p < 0.05), not a fixed improvement threshold — naive thresholds produce false positives ~30% of the time at typical sample sizes
- →Template variables need max_length limits and type validation — an unvalidated template variable is a prompt injection surface
- →Prompt drift: re-run your full eval suite weekly against the active prompt; prompts degrade as user patterns shift even when the text doesn't change
- →Rollback speed: the rollback path must be faster than the deployment path — under 10 seconds for registry-backed prompts
When You Need Prompt Management (and When You Don't)
Not every agent needs a prompt management system. A string constant in your codebase is fine when you have one developer, one prompt, and fewer than a few hundred daily users. The question is whether prompt iteration speed is bottlenecked by your deployment cycle — and whether a prompt change gone wrong can cause real damage before you catch it.
| Scenario | Prompt Count | Change Frequency | Recommendation |
|---|---|---|---|
| Solo dev, prototype, internal tool | 1–2 | Infrequent | Hardcode in source — don't over-engineer |
| Small team, single production agent | 3–8 | Weekly | Config file or env vars — simple, no infrastructure |
| Multi-prompt production agent, multiple devs | 8–20 | Daily | Managed platform (LangSmith Hub, Braintrust, PromptLayer) |
| Multi-team, shared prompts, compliance requirements | 20+ | Continuous | Custom registry — you need audit trails the platforms may not provide |
The moment you need to change a prompt because of a live production issue — and your only option is a full CI/CD cycle — is the moment you needed prompt management yesterday. It's not about frequency; it's about blast radius when iteration is slow.
The prompt iteration loop — starts with a failing case, ends with monitoring