Intermediate9 min
Logging, Metrics & Alerting
Building production dashboards for AI agents: structured logging, custom metrics (latency, cost, completion rate), and alerting on anomalies.
Quick Reference
- →Structured logging: emit JSON logs with correlation_id, user_id, agent_version, node_name, and duration for every graph step
- →Key agent metrics: p50/p95 latency per node, tokens per conversation, cost per conversation, task completion rate, tool error rate
- →Use metric cardinality wisely: label by agent_version and node_name, not by user_id or conversation_id (too many labels)
- →Alert on rate-of-change: a sudden spike in tool errors or token usage per request often signals a prompt or model regression
- →Build a single-pane dashboard showing agent health: request rate, error rate, latency distribution, and cost trend
Structured Logging for Agents
JSON logs, not print statements
Emit structured JSON logs with correlation_id, user_id, node_name, and duration at every graph step. Structured logs are queryable in log aggregation tools (CloudWatch, Datadog Logs, Loki). Unstructured logs are not.
Structured logging with structlog at every graph node boundary