Production & Scale/Production Operations
Advanced16 min

Debugging Production Agents

A decision-ordered guide to debugging production AI agents: choose the right observability tooling before writing custom code, build an investigation workflow that goes from user complaint to root cause in under 10 minutes, use LangGraph time-travel for deterministic replay, handle PII compliance in trace storage, and convert every incident into a regression test.

Quick Reference

  • Decision: LangSmith for LangChain/LangGraph stacks, OTel for existing APM infra, both via LangSmith OTLP endpoint
  • Investigation: complaint → trace ID → failing span → root cause in under 10 minutes via LangSmith SDK or Fetch CLI
  • Classify root causes: RETRIEVAL_MISS | STALE_DATA | HALLUCINATION | TOOL_FAILURE | CONTEXT_OVERFLOW
  • Replay: LangGraph get_state_history() + update_state() — always use idempotency guards on side-effecting tools first
  • Sampling: 100% of errors + 10–20% of successes; estimate trace storage cost before shipping to production
  • PII: hide_inputs/hide_outputs (LangSmith) or attributes/redact (OTel Collector) — never store raw PII in trace metadata

Observability: Build, Buy, or Both?

Agent ExecutionLLM calls · tool invocations · state transitionsTraces (OpenTelemetry)span every decision · trace_id links request → LLM → toolsMetricsp95 latency · error rate · tokens/request · cost/conversationAlert Rulesfire when metrics cross team-defined thresholdsDashboards + LangSmithhuman-visible · thread explorer · insights agent

Observability stack — data flows from raw execution up to human-readable dashboards

Before writing a single line of tracing code, answer this question: do you already have observability infrastructure? Three paths exist for AI agent observability in 2026, and the right choice depends on your existing stack, not your agent framework. LangSmith and OpenTelemetry both have mature AI agent support with standardized span hierarchies — building a custom tracing system from scratch is almost never justified.

SignalLangSmithOTel + Your APMBoth (LangSmith OTLP)
You use LangChain or LangGraph✓ Best choice — auto-instrumented, trace trees built inPossible but requires manual setupGood for large teams needing central APM
You already have Datadog, Grafana, or JaegerUse alongside — LangSmith for AI-specific views✓ Best choice — extend existing infra✓ Best choice — feed LangSmith data into your APM
Data must stay on-prem or single-regionNot supported (SaaS only)✓ Self-host Jaeger, Grafana Tempo, or LangfuseSelf-host OTel backend; LangSmith not eligible
Team owns ML/AI workloads only✓ Simplest path — one tool for traces and evalsRequires DevOps coordinationOverkill unless org scale demands it
When custom tracing makes sense

If data sovereignty requirements prevent any third-party service from seeing your traces, build custom trace collection — but use the OpenTelemetry wire format and self-hosted backends (Jaeger, Grafana Tempo, Langfuse). Don't invent a new schema. The OTel GenAI semantic conventions (stable 2026) define standard span attributes for LLM calls, tool executions, and agent invocations — use gen_ai.operation.name, gen_ai.agent.name, gen_ai.conversation.id.

This article covers debugging workflow, not tool setup

LangSmith setup and auto-instrumentation is covered in the LangSmith article. OpenTelemetry agent instrumentation is covered in the distributed tracing article. This article assumes traces exist and focuses on how to use them when something goes wrong.