Intermediate13 min

Logging, Metrics & Alerting

Learn when to build custom observability versus use a managed platform, then build it right: structured logging with correlation IDs, Prometheus metrics with cardinality discipline, and rate-of-change alerts that catch regressions before your users do.

Quick Reference

→Build custom observability (Prometheus + structlog) when you have existing Grafana infrastructure or need custom business metrics; use LangSmith or Langfuse for LLM-native observability with near-zero config
→Structured logging: emit JSON logs with correlation_id, node_name, agent_version, and duration_ms at every graph boundary — never log full prompt text in production
→The six metrics that matter: node p50/p95 latency, tokens per conversation, cost per conversation, task completion rate, tool error rate, retry count
→Cardinality rule: only label Prometheus metrics by values with fewer than 100 possible options — node_name and agent_version are safe; user_id and conversation_id will crash Prometheus at scale
→Alert on rate-of-change: a spike from 1% to 5% error rate in 5 minutes is an incident; a steady 3% error rate is normal non-determinism — alerting on the level trains your team to ignore pages
→Cost-per-conversation is the metric that catches prompt regressions invisible to error rate and latency: an extra reasoning loop doubles cost before it degrades user-visible quality
→Write a runbook for every alert before it fires in production — an alert without next steps is noise that erodes on-call trust

Should You Build Custom Observability?

Start with a managed platform

LangSmith gives you full traces with two environment variables and zero code changes. Langfuse is open-source and self-hostable. Braintrust adds eval-first observability on top. These platforms are LLM-native — they understand token counts, prompt inspection, and run comparison out of the box. Build custom Prometheus instrumentation only when you have a specific reason.

Logs answer "what happened," metrics answer "how much," traces answer "why"

The three questions that determine your stack: (1) Do you already have Prometheus and Grafana in production for your other services? If yes, building custom metrics means agents appear in the same dashboards as your APIs and databases — no new vendor, no new context-switch. (2) Do you need custom business metrics tied to agent behavior — revenue per conversation, lead quality score, or document processing cost? Managed platforms don't expose hooks for arbitrary business logic. (3) Do you need data sovereignty — all telemetry on your own infrastructure with no data leaving your VPC? If none of these apply, start with LangSmith and come back to this article when you outgrow it.

Structured Logging for Agents

JSON logs, not print statements

Emit structured JSON logs with correlation_id, node_name, agent_version, and duration_ms at every graph node boundary. Structured logs are queryable in log aggregation tools (CloudWatch Insights, Datadog Logs, Loki). Free-text logs are grep-only — they don't scale past one engineer searching manually.

The Six Metrics That Matter

Metric	What It Diagnoses	Alert Threshold
P50/P95 latency per node	Bottleneck nodes; slow model responses	P95 > 2× baseline for 10 min
Tokens per conversation	Context growth; prompt bloat regressions	Rate > 2× 24h average
Cost per conversation	Prompt regressions; extra reasoning loops	Rate > 2× 24h average
Task completion rate	Broken terminal states; logic regressions	Drop > 10% vs baseline
Tool error rate	Broken tool integrations; API changes upstream	Rate > 3× baseline in 5 min
Retry count	Inference failures; agent stuck in loops	Mean retries > 2 per conversation

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.