LangSmith Datasets & Experiments
Build versioned eval datasets from production traces, write evaluators that actually measure correctness, run experiments to prove prompt changes work, and gate deploys on regression — the full LangSmith evaluation workflow.
Quick Reference
- →Datasets are versioned collections of input/output examples evaluated systematically against your agent
- →Build from production traces first — synthetic examples miss the patterns that actually break in production
- →SDK v0.2 evaluators use (inputs, outputs, reference_outputs) dicts, not (run, example) objects
- →Experiments call evaluate() against a dataset, score each example, and store results for comparison
- →Gate deploys: run experiments in CI and fail the build if any evaluator drops below a defined threshold
- →Annotation queues let domain experts review traces and build high-quality ground-truth datasets
- →Dataset contamination and evaluator gaming are the two most common ways eval scores lie
When to Use Experiments (and When Not To)
LangSmith experiments are worth setting up when prompt changes or model swaps are frequent, when regressions are hard to catch manually, or when you need to prove to stakeholders that a change improves behavior. The investment pays off roughly once you're making more than one prompt change per week or when a regression in production would be costly.
| Situation | Use experiments? | Why |
|---|---|---|
| You change prompts weekly | Yes | Each change can silently regress edge cases |
| You're swapping models or providers | Yes | Model behavior differences need quantification |
| Simple deterministic pipeline, no LLM output variation | No | Unit tests are cheaper and sufficient |
| Early prototype, prompt changes 5x/day | Not yet | Dataset will go stale faster than you update it |
| Regulated output (medical, legal, financial) | Yes — required | Need reproducible proof of quality over time |
A dataset built against a prompt you'll rewrite in three days is waste. Stabilize your agent's inputs, outputs, and task scope first. Then invest in a dataset once the schema is locked.