Integrations/Observability
Advanced11 min

LangSmith Datasets & Experiments

Systematic evaluation with versioned datasets — create from traces, CSV, or manual entry, run experiments, compare results across prompt versions, and build annotation workflows.

Quick Reference

  • Datasets are collections of input/output examples used to evaluate agent behavior systematically
  • Create datasets from production traces, CSV uploads, or manual entry in the LangSmith UI
  • Experiments run your agent against a dataset and score each example with custom evaluators
  • Compare experiments side-by-side to measure the impact of prompt changes, model swaps, or code updates
  • Dataset versioning lets you track how your eval set evolves and pin experiments to specific versions

Creating Datasets

A dataset is a collection of examples, each with an input and an optional expected output (reference). The input matches what your agent receives; the output is the ground truth you evaluate against. LangSmith supports three creation paths: programmatic from traces, CSV upload, and manual entry in the UI.

Create a dataset from production traces
Upload a dataset from CSV
Start with 20-50 high-quality examples

A small, curated dataset beats a large noisy one. Hand-pick examples that cover your core use cases, known edge cases, and past failures. You can always expand later with automation rules that sample production traces.