Advanced11 min
LangSmith Datasets & Experiments
Systematic evaluation with versioned datasets — create from traces, CSV, or manual entry, run experiments, compare results across prompt versions, and build annotation workflows.
Quick Reference
- →Datasets are collections of input/output examples used to evaluate agent behavior systematically
- →Create datasets from production traces, CSV uploads, or manual entry in the LangSmith UI
- →Experiments run your agent against a dataset and score each example with custom evaluators
- →Compare experiments side-by-side to measure the impact of prompt changes, model swaps, or code updates
- →Dataset versioning lets you track how your eval set evolves and pin experiments to specific versions
Creating Datasets
A dataset is a collection of examples, each with an input and an optional expected output (reference). The input matches what your agent receives; the output is the ground truth you evaluate against. LangSmith supports three creation paths: programmatic from traces, CSV upload, and manual entry in the UI.
Create a dataset from production traces
Upload a dataset from CSV
Start with 20-50 high-quality examples
A small, curated dataset beats a large noisy one. Hand-pick examples that cover your core use cases, known edge cases, and past failures. You can always expand later with automation rules that sample production traces.