Advanced11 min
Testing Agents in CI
Unit testing tools, integration testing full graphs, snapshot testing outputs, mocking LLM responses, and building CI pipelines for agent systems.
Quick Reference
- →Unit test tools independently: mock inputs, call the tool function, assert the output shape and content
- →Integration test full graphs: create a graph with MemorySaver, invoke with test inputs, assert final state
- →Mock LLM responses with FakeListChatModel or recorded traces to make tests deterministic and fast
- →Snapshot test agent outputs: record a baseline response, assert future runs produce equivalent (not identical) output
- →Run agent tests in CI with a timeout — hanging agents in CI block the entire pipeline
Testing Pyramid for Agents
The agent testing pyramid inverts the traditional one
Traditional software has many unit tests and few E2E tests. Agent systems need many integration tests because bugs emerge from LLM + tool + state interactions, not from individual functions.
| Level | What You Test | Speed | Determinism | Coverage | Mocking |
|---|---|---|---|---|---|
| Tool Unit Tests | Individual tool functions with mocked inputs | Fast (<1s) | 100% deterministic | Tool logic, input validation, error handling | Mock external APIs |
| Node Unit Tests | Single graph nodes with mocked LLM | Fast (<1s) | 100% deterministic | State transitions, node logic | Mock LLM with FakeListChatModel |
| Graph Integration Tests | Full graph with mocked LLM + real checkpointer | Medium (1-5s) | Deterministic | Routing, state flow, multi-turn behavior | Mock LLM, real MemorySaver |
| Eval Tests | Full graph with real LLM on eval dataset | Slow (30-120s) | Non-deterministic | Quality, correctness, tool selection | Real LLM, may mock external tools |
| E2E / Smoke Tests | Full stack including API, auth, and streaming | Slow (10-60s) | Non-deterministic | Deployment health, integration with infra | Nothing mocked |
Start with tool unit tests -- they catch the most bugs with the least effort. A broken tool causes the agent to produce wrong results, retry unnecessarily, or hallucinate alternative approaches. Every tool should have tests for happy path, error cases, and malformed inputs.