Production & Scale/Production Operations
Advanced18 min

Migration & Graph Versioning

Most teams don't need graph versioning — this article starts there. For those who do: how to version state schemas, write safe migration functions with error handling and testing, ship blue-green agent deployments, monitor migrations in production, and recover when they go wrong.

Quick Reference

  • Most agents don't need graph versioning — short-lived threads (minutes to hours) can just redeploy with no migration
  • Graph versioning is only necessary when threads span multiple deployments and carry state that will break under schema changes
  • The best migration is the one you never write: use Optional fields with defaults to avoid breaking changes entirely
  • Lazy migration transforms checkpoints on read — not in bulk — avoiding downtime and spreading cost over time
  • Always include a schema_version field; without it you cannot determine which migrations to apply
  • Test migrations in CI against real production checkpoint snapshots — synthetic data misses the edge cases that crash production
  • Monitor migration success/failure rate on every read — a migrating checkpointer that silently corrupts state is worse than no migration
  • Never delete an old graph version while active threads still reference it — check active thread count first, every time

When You Need Graph Versioning (and When You Don't)

Most teams don't need this

If your agent threads are short-lived — minutes to a few hours — you don't need graph versioning. Deploy a new version and active threads finish on the old container before it spins down. Long-lived threads are where versioning earns its complexity.

The decision hinges on thread lifetime relative to your deploy frequency. A customer support bot handles turns that complete in seconds — deploy whenever you want. A research agent that runs for hours across user sessions needs versioning from day one. Before building any migration infrastructure, place your agent in one of three categories.

Are threadsshort-lived?YesJustRedeployNoAdding optionalfields or new nodes?YesBackward-compatibleno migrationNoRename, remove,or type change?YesMigrationRequired+ test itNoMake it Optional with Defaultpreferred path — no migration neededno migrationadditive changemigration required

Is my change breaking? → choose the path with least migration work

Deployment PatternThread LifetimeVersioning StrategyMigration Needed?
Stateless API wrapperSingle requestJust redeployNever
Chat assistantMinutes to hoursBlue-green with short drain windowRarely — additive changes only
Task/research agentHours to daysBlue-green + lazy migration for schema changesSometimes
Long-running workflowDays to weeksFull versioning pipeline requiredYes — always plan for it
Start with the simplest strategy that fits your thread lifetime

You can always upgrade from 'just redeploy' to 'lazy migration' as your threads grow longer. You cannot easily undo a corrupted checkpoint. Match your versioning infrastructure to your actual thread lifetime, not to a theoretical worst case.