Prompt Versioning & A/B Testing
A decision-first guide to managing prompts in production: when to build a registry, how to choose between LangSmith, Langfuse, and LaunchDarkly, how to gate promotions with an eval suite, and how to run statistically rigorous A/B tests instead of guessing.
Quick Reference
- →Skip the registry if you have one agent, one prompt, and changes are monthly — inline strings + git is enough
- →Langfuse (open-source, self-hostable) vs LangSmith (closed, LangChain-native) vs LaunchDarkly AI Configs (feature-flag-native) each solve a different team profile
- →Every promoted prompt must pass an eval gate: 20–50 golden test cases, automated check, CI blocks merge on regression
- →A/B test sample size depends on your baseline rate and minimum detectable effect — calculate it before starting, not after
- →hash(user_id) % 100 for consistent assignment — the same user must always see the same variant
- →Model drift is real: run your eval suite weekly even when you haven't changed the prompt; provider retrains break things silently
- →Automated rollback requires a monitoring job watching completion rate and cost — not a human watching a dashboard
- →Tag every LangSmith trace with prompt_version from day 1; retro-fitting this later is painful
Should I Manage Prompts Separately at All?
The overhead of a prompt registry — deployment pipeline, environment promotion, eval gate, monitoring — is real. Before building it, answer three questions: How often does the prompt change? Who needs to change it? What happens if a bad change reaches production?
| Signal | Appropriate strategy |
|---|---|
| One agent, one system prompt, changes monthly | Inline string in code. Version controlled with the code. A deploy is fine. |
| Multiple prompts, changes weekly, all by engineers | YAML/JSON files in the repo, loaded at startup. Still requires a deploy but separates concerns. |
| Non-engineers (PM, ops) need to edit prompts without a deploy | External registry: LangSmith, Langfuse, or LaunchDarkly AI Configs. Hot-swap on next request. |
| Multiple agents, separate environments, canary rollout required | External registry with environment tags (dev/staging/prod) and an eval gate in CI. |
| Regulated environment, full audit trail required | External registry with immutable commits, promotion approvals, and event webhooks for compliance. |
An external registry lets non-engineers update prompts without a deploy — which means without code review, without CI, and without an eval gate unless you build one explicitly. The freedom to change quickly is also the freedom to break quickly. Design your access controls and eval gates before you give anyone a 'push to prod' button.