Evaluation & Testing
The senior engineer's operational guide to agent evaluation: when to eval vs when not to, which method for each output type, real cost math, building datasets from production traces, CI/CD integration you own, production monitoring, and the failure modes of eval systems themselves.
Quick Reference
- →Match eval investment to traffic — manual spot-checks are appropriate for under 100 daily users; automated pipelines earn their cost above 1,000
- →Deterministic eval (schema, regex, exact match) covers structured outputs for $0 — start here before adding any LLM judges
- →LLM-as-judge with Sonnet 4.6 costs ~$0.0075/case; with Haiku it drops to ~$0.002/case — pick your judge model based on your weekly eval budget
- →Build golden datasets from production traces filtered by failures, complaints, and edge cases — not random samples
- →Run a 50-case smoke suite on every PR (under 5 min); run the full 500-case suite nightly to amortize judge costs
- →Online eval: code evaluators on 100% of traffic, LLM judge on 5–10% — 100% LLM-judge eval at 10K req/day costs $225/month with Sonnet
- →Eval systems fail silently through judge drift, dataset staleness, and narrow coverage — audit monthly
- →First 30 days: manual review (week 1), deterministic CI (week 2), LLM judge (week 3), production monitoring (week 4)
Should You Even Eval? The Over-Engineering Trap
Without quantitative evaluation, every prompt change is a gamble. You cannot tell if your agent improved or regressed until users complain. But over-engineering is the opposite trap: a full continuous eval pipeline costs engineering time and real money. Match your investment to your actual risk and traffic.
The first decision in agent evaluation is not which framework to use — it is whether to build automated evaluation at all. A production agent serving 50 users with infrequent prompt changes does not need a CI eval pipeline. A customer-facing agent at 10,000 daily users does. The right question is: what is the cost of a regression reaching users versus the cost of building and maintaining eval infrastructure?
| Stage | Daily Users | What to Build | Weekly Time | Weekly Cost |
|---|---|---|---|---|
| Prototype | < 100 | Manual spot-check 10 random traces per day | 30 min | $0 |
| Early Production | 100–1K | 50-case golden set + deterministic CI checks | 2 hours | $0–5 |
| Growth | 1K–10K | LLM-as-judge + CI quality gates + production sampling | 4 hours | $20–100 |
| Scale | > 10K | Continuous pipeline + online eval + alerting | Ongoing | $200+ |
If your agent outputs structured data — JSON, SQL, code, classification labels — deterministic validation catches 80% of failures for $0 in judge costs. Audit your agent's output types before reaching for LLM-as-judge.