Logging, Metrics & Alerting
Learn when to build custom observability versus use a managed platform, then build it right: structured logging with correlation IDs, Prometheus metrics with cardinality discipline, and rate-of-change alerts that catch regressions before your users do.
Quick Reference
- →Build custom observability (Prometheus + structlog) when you have existing Grafana infrastructure or need custom business metrics; use LangSmith or Langfuse for LLM-native observability with near-zero config
- →Structured logging: emit JSON logs with correlation_id, node_name, agent_version, and duration_ms at every graph boundary — never log full prompt text in production
- →The six metrics that matter: node p50/p95 latency, tokens per conversation, cost per conversation, task completion rate, tool error rate, retry count
- →Cardinality rule: only label Prometheus metrics by values with fewer than 100 possible options — node_name and agent_version are safe; user_id and conversation_id will crash Prometheus at scale
- →Alert on rate-of-change: a spike from 1% to 5% error rate in 5 minutes is an incident; a steady 3% error rate is normal non-determinism — alerting on the level trains your team to ignore pages
- →Cost-per-conversation is the metric that catches prompt regressions invisible to error rate and latency: an extra reasoning loop doubles cost before it degrades user-visible quality
- →Write a runbook for every alert before it fires in production — an alert without next steps is noise that erodes on-call trust
Should You Build Custom Observability?
LangSmith gives you full traces with two environment variables and zero code changes. Langfuse is open-source and self-hostable. Braintrust adds eval-first observability on top. These platforms are LLM-native — they understand token counts, prompt inspection, and run comparison out of the box. Build custom Prometheus instrumentation only when you have a specific reason.
Logs answer "what happened," metrics answer "how much," traces answer "why"
The three questions that determine your stack: (1) Do you already have Prometheus and Grafana in production for your other services? If yes, building custom metrics means agents appear in the same dashboards as your APIs and databases — no new vendor, no new context-switch. (2) Do you need custom business metrics tied to agent behavior — revenue per conversation, lead quality score, or document processing cost? Managed platforms don't expose hooks for arbitrary business logic. (3) Do you need data sovereignty — all telemetry on your own infrastructure with no data leaving your VPC? If none of these apply, start with LangSmith and come back to this article when you outgrow it.