Production & Scale/Production Operations
Advanced11 min

Testing Agents in CI

Unit testing tools, integration testing full graphs, snapshot testing outputs, mocking LLM responses, and building CI pipelines for agent systems.

Quick Reference

  • Unit test tools independently: mock inputs, call the tool function, assert the output shape and content
  • Integration test full graphs: create a graph with MemorySaver, invoke with test inputs, assert final state
  • Mock LLM responses with FakeListChatModel or recorded traces to make tests deterministic and fast
  • Snapshot test agent outputs: record a baseline response, assert future runs produce equivalent (not identical) output
  • Run agent tests in CI with a timeout — hanging agents in CI block the entire pipeline

Testing Pyramid for Agents

The agent testing pyramid inverts the traditional one

Traditional software has many unit tests and few E2E tests. Agent systems need many integration tests because bugs emerge from LLM + tool + state interactions, not from individual functions.

LevelWhat You TestSpeedDeterminismCoverageMocking
Tool Unit TestsIndividual tool functions with mocked inputsFast (<1s)100% deterministicTool logic, input validation, error handlingMock external APIs
Node Unit TestsSingle graph nodes with mocked LLMFast (<1s)100% deterministicState transitions, node logicMock LLM with FakeListChatModel
Graph Integration TestsFull graph with mocked LLM + real checkpointerMedium (1-5s)DeterministicRouting, state flow, multi-turn behaviorMock LLM, real MemorySaver
Eval TestsFull graph with real LLM on eval datasetSlow (30-120s)Non-deterministicQuality, correctness, tool selectionReal LLM, may mock external tools
E2E / Smoke TestsFull stack including API, auth, and streamingSlow (10-60s)Non-deterministicDeployment health, integration with infraNothing mocked

Start with tool unit tests -- they catch the most bugs with the least effort. A broken tool causes the agent to produce wrong results, retry unnecessarily, or hallucinate alternative approaches. Every tool should have tests for happy path, error cases, and malformed inputs.