Advanced12 min

Distributed Tracing

Cross-service trace propagation for multi-service agents — decide if you need it, choose between LangSmith-native and OTel approaches, link traces across HTTP boundaries, and control costs with sampling.

Quick Reference

→Single-process agents don't need distributed tracing — LangSmith already has the complete trace
→LangSmith-native: @traceable + to_headers() + TracingMiddleware — simplest path if you're already on LangSmith
→OTel-native: LANGSMITH_OTEL_ENABLED=true emits W3C traceparent headers for existing APM platforms
→Multi-agent systems need explicit callback/config inheritance — asyncio.create_task() does NOT propagate context
→Start with 10% head-based sampling + always-trace-errors before adding tail-based complexity

Do You Need Distributed Tracing?

Before adding distributed tracing infrastructure, answer one question: does your agent make HTTP calls to other services you control? If no — if it runs in a single process — LangSmith already gives you a complete trace of every LLM call, tool invocation, and chain step. Distributed tracing adds complexity with zero observability gain in that case.

Diagram: tracing-need-decision-tree

Coming soon

When distributed tracing is overkill

Single-process agents (even complex ones with 10+ tools), prototype environments, single-service FastAPI apps that call LLMs directly, and systems under ~100 requests/day rarely need cross-service trace propagation. The added infrastructure cost outweighs the visibility gain.

Distributed tracing earns its cost when your agent spans multiple services: an API gateway, an orchestration layer, a retrieval service, each potentially making their own LLM calls. Without it, you see each service's trace in isolation. You cannot answer 'why was this request slow?' when the bottleneck is three services deep.

▸API gateway latency (auth, rate limiting) is invisible to LangSmith — correlated traces make it visible
▸A tool call hitting an external service creates a gap in LangSmith traces — backend spans fill that gap
▸Multi-agent orchestration that spans HTTP services makes cross-agent failure debugging impossible without trace correlation
▸Cost attribution to specific users or workflows requires end-to-end traces that cross service boundaries

LangSmith Native vs OTel: Pick Your Path

Two valid approaches exist. LangSmith-native propagation uses LangSmith's own header format and keeps everything in the LangSmith UI. OTel-native propagation uses W3C standard headers and routes traces to your existing APM platform (Datadog, Grafana Tempo, Jaeger). The choice depends on where your team already looks at traces — not which is objectively better.

Cross-Service Propagation with LangSmith

RunTree manual construction is not recommended

Older tutorials show RunTree with .post()/.patch() and custom header names like x-langsmith-run-id. LangSmith's own docs now say this pattern is 'not recommended for most use cases' — it's error-prone and the header names are non-standard. Use @traceable with to_headers() instead.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.