Testing Agents in CI

The complete testing strategy for LLM-powered agents: when to invest in each layer, tool unit tests, node tests, graph integration tests, LangSmith eval pipelines, regression detection for schema drift and routing shifts, and a CI pipeline that gates on quality.

Quick Reference

→Unit test tool functions directly: mock external APIs with patch(), call tool.ainvoke(), assert output shape and error propagation
→Test graph nodes independently via graph.nodes['node_name'].invoke(state) — no full agent invocation needed
→Integration test the full graph with FakeMessagesListChatModel + MemorySaver: deterministic, free, no API keys
→Use langsmith.evaluate() with a golden dataset on PRs; fail CI if avg score drops below threshold
→Detect schema drift by asserting tool.args_schema.model_json_schema() in tests — catches silent failures before they reach prod
→Detect cost regression by counting ToolMessages in result['messages'] and asserting len <= expected
→Unit and integration tests need zero API keys; if a test requires ANTHROPIC_API_KEY, you are not mocking correctly
→Run eval tests only on pull requests — not every push — to control cost without sacrificing quality gates

When NOT to Build an Agent Test Suite

Don't build a test suite before you have something worth testing

The most common mistake is investing in eval infrastructure for an agent that changes every week. A golden dataset collected against last week's prompt is worthless this week. Build tests incrementally as each layer stabilizes.

Build your test suite incrementally — each level earns its cost before adding the next

Agent Maturity	What to Build	What to Skip
Prototype (< 1 week old)	Tool unit tests for any stable tool functions	Eval datasets, routing tests, CI stages
Early production (real users, changing prompt)	Tool + integration tests; occasional manual eval	Automated eval pipelines (prompt changes too fast)
Stable production (prompt frozen ≥ 2 weeks)	Full suite: unit + integration + LangSmith evals + regression gates	Nothing — this is when the full pipeline pays off
Post-regression (quality dropped after model update)	Regression tests for the specific failure; add to CI	Rewriting the whole suite — fix the specific gap first

The investment decision is about signal stability. A unit test for a tool function is worth it from day one — tool logic is deterministic and doesn't change with prompt iterations. An eval dataset is worth it once your prompt has been stable for at least two weeks, because dataset examples become stale the moment the task definition shifts.

The Agent Testing Pyramid

Start at the base — tool unit tests catch the most bugs per hour of effort

Tool Unit Tests

Tool tests are pure function tests: given specific inputs, assert specific outputs or error behavior. Mock all external dependencies to keep tests fast and deterministic. The most important non-obvious thing to test is the tool schema — the LLM calls your tool based on its schema, and a wrong schema produces silent failures.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.