Debugging Production Agents

A decision-ordered guide to debugging production AI agents: choose the right observability tooling before writing custom code, build an investigation workflow that goes from user complaint to root cause in under 10 minutes, use LangGraph time-travel for deterministic replay, handle PII compliance in trace storage, and convert every incident into a regression test.

Quick Reference

→Decision: LangSmith for LangChain/LangGraph stacks, OTel for existing APM infra, both via LangSmith OTLP endpoint
→Investigation: complaint → trace ID → failing span → root cause in under 10 minutes via LangSmith SDK or Fetch CLI
→Classify root causes: RETRIEVAL_MISS | STALE_DATA | HALLUCINATION | TOOL_FAILURE | CONTEXT_OVERFLOW
→Replay: LangGraph get_state_history() + update_state() — always use idempotency guards on side-effecting tools first
→Sampling: 100% of errors + 10–20% of successes; estimate trace storage cost before shipping to production
→PII: hide_inputs/hide_outputs (LangSmith) or attributes/redact (OTel Collector) — never store raw PII in trace metadata

Observability: Build, Buy, or Both?

Observability stack — data flows from raw execution up to human-readable dashboards

Before writing a single line of tracing code, answer this question: do you already have observability infrastructure? Three paths exist for AI agent observability in 2026, and the right choice depends on your existing stack, not your agent framework. LangSmith and OpenTelemetry both have mature AI agent support with standardized span hierarchies — building a custom tracing system from scratch is almost never justified.

Signal	LangSmith	OTel + Your APM	Both (LangSmith OTLP)
You use LangChain or LangGraph	✓ Best choice — auto-instrumented, trace trees built in	Possible but requires manual setup	Good for large teams needing central APM
You already have Datadog, Grafana, or Jaeger	Use alongside — LangSmith for AI-specific views	✓ Best choice — extend existing infra	✓ Best choice — feed LangSmith data into your APM
Data must stay on-prem or single-region	Not supported (SaaS only)	✓ Self-host Jaeger, Grafana Tempo, or Langfuse	Self-host OTel backend; LangSmith not eligible
Team owns ML/AI workloads only	✓ Simplest path — one tool for traces and evals	Requires DevOps coordination	Overkill unless org scale demands it

When custom tracing makes sense

If data sovereignty requirements prevent any third-party service from seeing your traces, build custom trace collection — but use the OpenTelemetry wire format and self-hosted backends (Jaeger, Grafana Tempo, Langfuse). Don't invent a new schema. The OTel GenAI semantic conventions (stable 2026) define standard span attributes for LLM calls, tool executions, and agent invocations — use gen_ai.operation.name, gen_ai.agent.name, gen_ai.conversation.id.

This article covers debugging workflow, not tool setup

LangSmith setup and auto-instrumentation is covered in the LangSmith article. OpenTelemetry agent instrumentation is covered in the distributed tracing article. This article assumes traces exist and focuses on how to use them when something goes wrong.

From Complaint to Root Cause in 10 Minutes

When a user reports 'the agent gave me wrong information about X', you need a fast, deterministic path from complaint to root cause. The chain is: user complaint → thread ID → trace ID → failing span → exact prompt and retrieved context that produced the wrong answer. If any link in this chain is missing or takes more than a minute to traverse, your instrumentation has gaps.

Production Failure Patterns

Agent incident triage — classify first, then execute the matching response playbook

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.