Production & Scale/Production Operations
Advanced18 min

Testing Agents in CI

The complete testing strategy for LLM-powered agents: when to invest in each layer, tool unit tests, node tests, graph integration tests, LangSmith eval pipelines, regression detection for schema drift and routing shifts, and a CI pipeline that gates on quality.

Quick Reference

  • Unit test tool functions directly: mock external APIs with patch(), call tool.ainvoke(), assert output shape and error propagation
  • Test graph nodes independently via graph.nodes['node_name'].invoke(state) — no full agent invocation needed
  • Integration test the full graph with FakeMessagesListChatModel + MemorySaver: deterministic, free, no API keys
  • Use langsmith.evaluate() with a golden dataset on PRs; fail CI if avg score drops below threshold
  • Detect schema drift by asserting tool.args_schema.model_json_schema() in tests — catches silent failures before they reach prod
  • Detect cost regression by counting ToolMessages in result['messages'] and asserting len <= expected
  • Unit and integration tests need zero API keys; if a test requires ANTHROPIC_API_KEY, you are not mocking correctly
  • Run eval tests only on pull requests — not every push — to control cost without sacrificing quality gates

When NOT to Build an Agent Test Suite

Don't build a test suite before you have something worth testing

The most common mistake is investing in eval infrastructure for an agent that changes every week. A golden dataset collected against last week's prompt is worthless this week. Build tests incrementally as each layer stabilizes.

Do you have tools in your agent?YesNoSmoke tests onlybasic invoke + assert responseMulti-node graph with routing?YesNoTool unit tests · donetest each tool independentlyLive in production?YesNo+ Integration tests · donefull graph with mocked LLMRecurring quality regressions?YesNo+ Eval · CI gate · doneLangSmith evaluate() on PRsFull regression suitetool · node · graph · eval · regression

Build your test suite incrementally — each level earns its cost before adding the next

Agent MaturityWhat to BuildWhat to Skip
Prototype (< 1 week old)Tool unit tests for any stable tool functionsEval datasets, routing tests, CI stages
Early production (real users, changing prompt)Tool + integration tests; occasional manual evalAutomated eval pipelines (prompt changes too fast)
Stable production (prompt frozen ≥ 2 weeks)Full suite: unit + integration + LangSmith evals + regression gatesNothing — this is when the full pipeline pays off
Post-regression (quality dropped after model update)Regression tests for the specific failure; add to CIRewriting the whole suite — fix the specific gap first

The investment decision is about signal stability. A unit test for a tool function is worth it from day one — tool logic is deterministic and doesn't change with prompt iterations. An eval dataset is worth it once your prompt has been stable for at least two weeks, because dataset examples become stale the moment the task definition shifts.