Evaluation & Testing

The senior engineer's operational guide to agent evaluation: when to eval vs when not to, which method for each output type, real cost math, building datasets from production traces, CI/CD integration you own, production monitoring, and the failure modes of eval systems themselves.

Quick Reference

→Match eval investment to traffic — manual spot-checks are appropriate for under 100 daily users; automated pipelines earn their cost above 1,000
→Deterministic eval (schema, regex, exact match) covers structured outputs for $0 — start here before adding any LLM judges
→LLM-as-judge with Sonnet 4.6 costs ~$0.0075/case; with Haiku it drops to ~$0.002/case — pick your judge model based on your weekly eval budget
→Build golden datasets from production traces filtered by failures, complaints, and edge cases — not random samples
→Run a 50-case smoke suite on every PR (under 5 min); run the full 500-case suite nightly to amortize judge costs
→Online eval: code evaluators on 100% of traffic, LLM judge on 5–10% — 100% LLM-judge eval at 10K req/day costs $225/month with Sonnet
→Eval systems fail silently through judge drift, dataset staleness, and narrow coverage — audit monthly
→First 30 days: manual review (week 1), deterministic CI (week 2), LLM judge (week 3), production monitoring (week 4)

Should You Even Eval? The Over-Engineering Trap

Vibes-based testing does not scale

Without quantitative evaluation, every prompt change is a gamble. You cannot tell if your agent improved or regressed until users complain. But over-engineering is the opposite trap: a full continuous eval pipeline costs engineering time and real money. Match your investment to your actual risk and traffic.

The first decision in agent evaluation is not which framework to use — it is whether to build automated evaluation at all. A production agent serving 50 users with infrequent prompt changes does not need a CI eval pipeline. A customer-facing agent at 10,000 daily users does. The right question is: what is the cost of a regression reaching users versus the cost of building and maintaining eval infrastructure?

Stage	Daily Users	What to Build	Weekly Time	Weekly Cost
Prototype	< 100	Manual spot-check 10 random traces per day	30 min	$0
Early Production	100–1K	50-case golden set + deterministic CI checks	2 hours	$0–5
Growth	1K–10K	LLM-as-judge + CI quality gates + production sampling	4 hours	$20–100
Scale	> 10K	Continuous pipeline + online eval + alerting	Ongoing	$200+

The cheapest eval is the one you do not need

If your agent outputs structured data — JSON, SQL, code, classification labels — deterministic validation catches 80% of failures for $0 in judge costs. Audit your agent's output types before reaching for LLM-as-judge.

Which Eval Method? A Decision Framework

There are three eval families: deterministic (exact match, regex, schema validation, code execution), LLM-as-judge (rubric-based scoring by a model), and human review (manual annotation). The decision is not which is best in general — it is which is cheapest for your specific output type. Deterministic eval is 100% accurate for what it can measure and costs nothing to run. LLM-as-judge handles semantic quality but costs $0.001–0.02 per case and has its own failure modes. Human review is ground truth but does not scale beyond calibration and novel edge cases.

What Eval Actually Costs

Most eval articles skip the cost calculation. Here is the arithmetic. A typical LLM-as-judge call for an agent eval case sends approximately 1,500 input tokens (system prompt + rubric + agent output) and receives approximately 200 output tokens (JSON scores). Using Claude Sonnet 4.6 at approximately $3/M input and $15/M output: cost per case = (3 × 1,500 + 15 × 200) / 1,000,000 = $0.0075. A 500-case CI suite run on every PR: $0.0075 × 500 = $3.75/run. At 10 PRs/week: $37.50/week. Using Haiku at $0.80/M input and $4/M output, the same case costs $0.002 — an 18× reduction.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.