Integrations/Observability
Advanced14 min

LangSmith Datasets & Experiments

Build versioned eval datasets from production traces, write evaluators that actually measure correctness, run experiments to prove prompt changes work, and gate deploys on regression — the full LangSmith evaluation workflow.

Quick Reference

  • Datasets are versioned collections of input/output examples evaluated systematically against your agent
  • Build from production traces first — synthetic examples miss the patterns that actually break in production
  • SDK v0.2 evaluators use (inputs, outputs, reference_outputs) dicts, not (run, example) objects
  • Experiments call evaluate() against a dataset, score each example, and store results for comparison
  • Gate deploys: run experiments in CI and fail the build if any evaluator drops below a defined threshold
  • Annotation queues let domain experts review traces and build high-quality ground-truth datasets
  • Dataset contamination and evaluator gaming are the two most common ways eval scores lie

When to Use Experiments (and When Not To)

LangSmith experiments are worth setting up when prompt changes or model swaps are frequent, when regressions are hard to catch manually, or when you need to prove to stakeholders that a change improves behavior. The investment pays off roughly once you're making more than one prompt change per week or when a regression in production would be costly.

SituationUse experiments?Why
You change prompts weeklyYesEach change can silently regress edge cases
You're swapping models or providersYesModel behavior differences need quantification
Simple deterministic pipeline, no LLM output variationNoUnit tests are cheaper and sufficient
Early prototype, prompt changes 5x/dayNot yetDataset will go stale faster than you update it
Regulated output (medical, legal, financial)Yes — requiredNeed reproducible proof of quality over time
Don't start with eval if your core loop isn't stable

A dataset built against a prompt you'll rewrite in three days is waste. Stabilize your agent's inputs, outputs, and task scope first. Then invest in a dataset once the schema is locked.