Testing Agents in CI
The complete testing strategy for LLM-powered agents: when to invest in each layer, tool unit tests, node tests, graph integration tests, LangSmith eval pipelines, regression detection for schema drift and routing shifts, and a CI pipeline that gates on quality.
Quick Reference
- →Unit test tool functions directly: mock external APIs with patch(), call tool.ainvoke(), assert output shape and error propagation
- →Test graph nodes independently via graph.nodes['node_name'].invoke(state) — no full agent invocation needed
- →Integration test the full graph with FakeMessagesListChatModel + MemorySaver: deterministic, free, no API keys
- →Use langsmith.evaluate() with a golden dataset on PRs; fail CI if avg score drops below threshold
- →Detect schema drift by asserting tool.args_schema.model_json_schema() in tests — catches silent failures before they reach prod
- →Detect cost regression by counting ToolMessages in result['messages'] and asserting len <= expected
- →Unit and integration tests need zero API keys; if a test requires ANTHROPIC_API_KEY, you are not mocking correctly
- →Run eval tests only on pull requests — not every push — to control cost without sacrificing quality gates
When NOT to Build an Agent Test Suite
The most common mistake is investing in eval infrastructure for an agent that changes every week. A golden dataset collected against last week's prompt is worthless this week. Build tests incrementally as each layer stabilizes.
Build your test suite incrementally — each level earns its cost before adding the next
| Agent Maturity | What to Build | What to Skip |
|---|---|---|
| Prototype (< 1 week old) | Tool unit tests for any stable tool functions | Eval datasets, routing tests, CI stages |
| Early production (real users, changing prompt) | Tool + integration tests; occasional manual eval | Automated eval pipelines (prompt changes too fast) |
| Stable production (prompt frozen ≥ 2 weeks) | Full suite: unit + integration + LangSmith evals + regression gates | Nothing — this is when the full pipeline pays off |
| Post-regression (quality dropped after model update) | Regression tests for the specific failure; add to CI | Rewriting the whole suite — fix the specific gap first |
The investment decision is about signal stability. A unit test for a tool function is worth it from day one — tool logic is deterministic and doesn't change with prompt iterations. An eval dataset is worth it once your prompt has been stable for at least two weeks, because dataset examples become stale the moment the task definition shifts.