Intermediate20 min

OpenTelemetry for Agent Tracing

How to instrument LangGraph agents with OpenTelemetry: the Collector architecture you actually need in production, updated GenAI semantic conventions, cost math for sampling decisions, and the failure modes that will bite you before you notice.

Quick Reference

→OTel is the CNCF standard for vendor-neutral distributed tracing — instrument once, export to Datadog, Grafana Tempo, Jaeger, or any OTLP-compatible backend
→Production OTel requires the OpenTelemetry Collector — direct app→backend export bypasses tail-based sampling and lacks the buffering you need when a backend goes down
→Instrument each LangGraph node as a span using gen_ai.operation.name, gen_ai.provider.name, and token-count attributes; wrap tool calls with execute_tool spans
→The GenAI semantic conventions (2026) define 4 span types: inference, embeddings, retrieval, and execute_tool — each with distinct required and recommended attributes
→Head-based sampling lives in your Python code (ParentBasedTraceIdRatio); tail-based sampling (always keep errors, sample N% of success) runs in the Collector's tailsamplingprocessor
→20 spans/invocation × 1000 req/hr = 480K spans/day — run this math before picking a sampling ratio; skipping it means surprise bills
→LangSmith now accepts OTel traces natively via LANGSMITH_OTEL_ENABLED=true — configure the Collector to export to both LangSmith and your infra backend in a single pipeline
→Never use conversation_id, user_id, or prompt content as span attributes in high-traffic systems — cardinality explosion will crash your tracing backend's index

Should You Use OTel?

Before writing a single span, decide whether OTel is the right tool for this problem. The answer depends on what you already have, not on which standard is more vendor-neutral.

LangSmith now accepts OTel traces natively — "both" costs one Collector config, not two codebases

Signal	Choose OTel	Choose LangSmith	Choose Both
Existing observability stack	Datadog, Grafana, or Jaeger already deployed	No tracing infrastructure yet	Have both or plan to add LLM debugging to existing infra
What you're debugging	Latency across services, database calls, queue processing	Prompt quality, token costs, eval regressions	Infrastructure latency AND prompt/eval issues
Team context	SRE team manages observability; agent is one microservice of many	ML team owns the agent end-to-end	Cross-functional team; both SRE and ML visibility matter
Budget	Pay per span volume (backend pricing)	Pay per trace in LangSmith	Both costs, but the Collector fan-out means one instrumentation layer

LangSmith and OTel converged in 2025

LangSmith now accepts OpenTelemetry traces natively via LANGSMITH_OTEL_ENABLED=true. You no longer have to choose one or instrument twice. Configure the OTel Collector to export to both your infra backend and LangSmith's OTLP endpoint, and you get infrastructure-wide traces plus LLM-specific debugging in the same pipeline.

▸Skip OTel if your agent is a standalone CLI or batch job with no latency SLAs — LangSmith alone is cheaper to set up
▸Skip OTel if the agent has no external service calls — a single-service agent with no databases or APIs won't benefit from distributed context propagation
▸Use OTel if the agent calls other agents, hits external APIs, or reads from databases you already instrument
▸Use OTel if your SRE team needs to see the agent's spans in the same dashboard as the rest of your backend

The Collector Pattern

Direct app→backend export is a production antipattern

Every tutorial shows OTLPSpanExporter pointing directly at Datadog or Grafana. That works in development. In production it means: (1) your app blocks on export retries when the backend is slow, (2) tail-based sampling is impossible because the full trace never lands in one place before the sampling decision, and (3) switching backends requires touching app code. The Collector pattern solves all three.

Instrumenting Agent Nodes as Spans

invoke_agent is the root span — LLM calls, tool executions, and retrievals nest inside it

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.