Advanced12 min
Evaluation & Testing
How to evaluate agent quality: LangSmith datasets, LLM-as-judge scoring, regression testing, and CI/CD integration for agents.
Quick Reference
- →Build evaluation datasets from production traces — real user queries are the best test cases
- →Use LLM-as-judge for subjective quality: a strong model scores agent outputs on defined rubrics
- →Track key metrics over time: task completion rate, tool selection accuracy, average turns to completion
- →Run evals in CI: block merges that degrade task completion rate below a threshold (e.g., 90%)
- →Use LangSmith's comparison view to diff agent versions side-by-side on the same dataset
Why Eval Matters
Vibes-based testing does not scale
Without quantitative evaluation, every prompt change is a gamble. You cannot tell if your agent improved or regressed until users complain. Evals convert intuition into metrics.
Agent evaluation is fundamentally harder than traditional software testing because outputs are non-deterministic, quality is subjective, and failure modes are emergent. A passing unit test does not mean your agent will handle real user queries well. You need a dedicated evaluation pipeline that runs on every change.
Eval pipeline: traces to dataset, judge scores, CI gate blocks regressions
- ▸Deterministic evals (exact match, regex, schema validation) for structured outputs
- ▸LLM-as-judge for open-ended quality assessment with scoring rubrics
- ▸Human eval for calibrating your automated eval pipeline -- the ground truth
- ▸Production traces as the source of truth for what users actually ask
- ▸Regression testing to catch quality degradation before it reaches users