Model Management

Once you run more than one model in production, you need a registry, a promotion pipeline, and automatic rollback — or you will lose track of what is serving traffic and why quality changed. This article builds those three things from scratch, with cost math and threshold derivation included.

Quick Reference

→You need model management when: two or more models serve production, or prompt + model change independently
→Pipeline: STAGING → SHADOW → CANARY (1–25%) → ACTIVE → DEPRECATED
→Shadow cost = (requests/day × avg_cost_per_call) × 2 — budget it before starting
→A/B experiments on LLMs need 10K+ sessions per arm; non-determinism inflates variance
→Set rollback thresholds relative to your baseline, not from a table of magic numbers
→Version bundle = model + prompt hash + config — never change two variables in the same bundle
→Rollback must complete in under 60 seconds — automate it, never rely on manual action

When You Need Model Management (and When You Don't)

If you run one model with one prompt and you change them together whenever you feel like it, you do not need a model registry. You need one when you can no longer answer these questions from memory: which model is serving traffic right now, what prompt version it is using, and what baseline metrics it was registered with. That threshold is usually two models in production — one for cheap/fast tasks, one for quality-critical tasks — or the first time your prompt team and your model team start shipping changes independently.

Situation	What you need
Single model, prompt changes reviewed as code	Version control is enough — no registry
Two models (e.g., cheap vs. quality tier)	Registry + bundle versioning
Prompt team and model team ship independently	Registry + promotion pipeline
More than 5% of traffic complaints are 'it used to work'	Registry + automatic rollback
Multiple providers (OpenAI + Anthropic + self-hosted)	Registry + provider failover callout

Start with version bundles, add the registry later

You can get most of the reproducibility benefit by just naming your bundles (e.g., 'support-agent-v12') and storing the model ID, prompt hash, and config together — before you build any registry infrastructure. The registry is just a queryable store on top of that.

The Deployment Pipeline: Shadow → Canary → GA

Every new model bundle moves through four stages. Skipping stages is how teams end up with a production incident at 2am. Shadow is zero user risk. Canary is controlled risk. GA is full commitment. The registry tracks which stage each bundle is in and enforces the transitions.

Building the Model Registry

Model registry with versioned bundles

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.