Production & Scale/Production Operations
Advanced12 min

Evaluation & Testing

How to evaluate agent quality: LangSmith datasets, LLM-as-judge scoring, regression testing, and CI/CD integration for agents.

Quick Reference

  • Build evaluation datasets from production traces — real user queries are the best test cases
  • Use LLM-as-judge for subjective quality: a strong model scores agent outputs on defined rubrics
  • Track key metrics over time: task completion rate, tool selection accuracy, average turns to completion
  • Run evals in CI: block merges that degrade task completion rate below a threshold (e.g., 90%)
  • Use LangSmith's comparison view to diff agent versions side-by-side on the same dataset

Why Eval Matters

Vibes-based testing does not scale

Without quantitative evaluation, every prompt change is a gamble. You cannot tell if your agent improved or regressed until users complain. Evals convert intuition into metrics.

Agent evaluation is fundamentally harder than traditional software testing because outputs are non-deterministic, quality is subjective, and failure modes are emergent. A passing unit test does not mean your agent will handle real user queries well. You need a dedicated evaluation pipeline that runs on every change.

ProductionTracesEval Datasetcurated samplesAgent Runre-execute tasksLLM Judgescore outputsCI Gateblock onregressionMetricsDashtrendspassfail → block deployEvaluation Pipeline

Eval pipeline: traces to dataset, judge scores, CI gate blocks regressions

  • Deterministic evals (exact match, regex, schema validation) for structured outputs
  • LLM-as-judge for open-ended quality assessment with scoring rubrics
  • Human eval for calibrating your automated eval pipeline -- the ground truth
  • Production traces as the source of truth for what users actually ask
  • Regression testing to catch quality degradation before it reaches users