Debugging Production Agents
A decision-ordered guide to debugging production AI agents: choose the right observability tooling before writing custom code, build an investigation workflow that goes from user complaint to root cause in under 10 minutes, use LangGraph time-travel for deterministic replay, handle PII compliance in trace storage, and convert every incident into a regression test.
Quick Reference
- →Decision: LangSmith for LangChain/LangGraph stacks, OTel for existing APM infra, both via LangSmith OTLP endpoint
- →Investigation: complaint → trace ID → failing span → root cause in under 10 minutes via LangSmith SDK or Fetch CLI
- →Classify root causes: RETRIEVAL_MISS | STALE_DATA | HALLUCINATION | TOOL_FAILURE | CONTEXT_OVERFLOW
- →Replay: LangGraph get_state_history() + update_state() — always use idempotency guards on side-effecting tools first
- →Sampling: 100% of errors + 10–20% of successes; estimate trace storage cost before shipping to production
- →PII: hide_inputs/hide_outputs (LangSmith) or attributes/redact (OTel Collector) — never store raw PII in trace metadata
Observability: Build, Buy, or Both?
Observability stack — data flows from raw execution up to human-readable dashboards
Before writing a single line of tracing code, answer this question: do you already have observability infrastructure? Three paths exist for AI agent observability in 2026, and the right choice depends on your existing stack, not your agent framework. LangSmith and OpenTelemetry both have mature AI agent support with standardized span hierarchies — building a custom tracing system from scratch is almost never justified.
| Signal | LangSmith | OTel + Your APM | Both (LangSmith OTLP) |
|---|---|---|---|
| You use LangChain or LangGraph | ✓ Best choice — auto-instrumented, trace trees built in | Possible but requires manual setup | Good for large teams needing central APM |
| You already have Datadog, Grafana, or Jaeger | Use alongside — LangSmith for AI-specific views | ✓ Best choice — extend existing infra | ✓ Best choice — feed LangSmith data into your APM |
| Data must stay on-prem or single-region | Not supported (SaaS only) | ✓ Self-host Jaeger, Grafana Tempo, or Langfuse | Self-host OTel backend; LangSmith not eligible |
| Team owns ML/AI workloads only | ✓ Simplest path — one tool for traces and evals | Requires DevOps coordination | Overkill unless org scale demands it |
If data sovereignty requirements prevent any third-party service from seeing your traces, build custom trace collection — but use the OpenTelemetry wire format and self-hosted backends (Jaeger, Grafana Tempo, Langfuse). Don't invent a new schema. The OTel GenAI semantic conventions (stable 2026) define standard span attributes for LLM calls, tool executions, and agent invocations — use gen_ai.operation.name, gen_ai.agent.name, gen_ai.conversation.id.
LangSmith setup and auto-instrumentation is covered in the LangSmith article. OpenTelemetry agent instrumentation is covered in the distributed tracing article. This article assumes traces exist and focuses on how to use them when something goes wrong.