OpenTelemetry for Agent Tracing
How to instrument LangGraph agents with OpenTelemetry: the Collector architecture you actually need in production, updated GenAI semantic conventions, cost math for sampling decisions, and the failure modes that will bite you before you notice.
Quick Reference
- →OTel is the CNCF standard for vendor-neutral distributed tracing — instrument once, export to Datadog, Grafana Tempo, Jaeger, or any OTLP-compatible backend
- →Production OTel requires the OpenTelemetry Collector — direct app→backend export bypasses tail-based sampling and lacks the buffering you need when a backend goes down
- →Instrument each LangGraph node as a span using gen_ai.operation.name, gen_ai.provider.name, and token-count attributes; wrap tool calls with execute_tool spans
- →The GenAI semantic conventions (2026) define 4 span types: inference, embeddings, retrieval, and execute_tool — each with distinct required and recommended attributes
- →Head-based sampling lives in your Python code (ParentBasedTraceIdRatio); tail-based sampling (always keep errors, sample N% of success) runs in the Collector's tailsamplingprocessor
- →20 spans/invocation × 1000 req/hr = 480K spans/day — run this math before picking a sampling ratio; skipping it means surprise bills
- →LangSmith now accepts OTel traces natively via LANGSMITH_OTEL_ENABLED=true — configure the Collector to export to both LangSmith and your infra backend in a single pipeline
- →Never use conversation_id, user_id, or prompt content as span attributes in high-traffic systems — cardinality explosion will crash your tracing backend's index
Should You Use OTel?
Before writing a single span, decide whether OTel is the right tool for this problem. The answer depends on what you already have, not on which standard is more vendor-neutral.
LangSmith now accepts OTel traces natively — "both" costs one Collector config, not two codebases
| Signal | Choose OTel | Choose LangSmith | Choose Both |
|---|---|---|---|
| Existing observability stack | Datadog, Grafana, or Jaeger already deployed | No tracing infrastructure yet | Have both or plan to add LLM debugging to existing infra |
| What you're debugging | Latency across services, database calls, queue processing | Prompt quality, token costs, eval regressions | Infrastructure latency AND prompt/eval issues |
| Team context | SRE team manages observability; agent is one microservice of many | ML team owns the agent end-to-end | Cross-functional team; both SRE and ML visibility matter |
| Budget | Pay per span volume (backend pricing) | Pay per trace in LangSmith | Both costs, but the Collector fan-out means one instrumentation layer |
LangSmith now accepts OpenTelemetry traces natively via LANGSMITH_OTEL_ENABLED=true. You no longer have to choose one or instrument twice. Configure the OTel Collector to export to both your infra backend and LangSmith's OTLP endpoint, and you get infrastructure-wide traces plus LLM-specific debugging in the same pipeline.
- ▸Skip OTel if your agent is a standalone CLI or batch job with no latency SLAs — LangSmith alone is cheaper to set up
- ▸Skip OTel if the agent has no external service calls — a single-service agent with no databases or APIs won't benefit from distributed context propagation
- ▸Use OTel if the agent calls other agents, hits external APIs, or reads from databases you already instrument
- ▸Use OTel if your SRE team needs to see the agent's spans in the same dashboard as the rest of your backend