Production & Scale/Production Operations
Advanced18 min

Evaluation & Testing

The senior engineer's operational guide to agent evaluation: when to eval vs when not to, which method for each output type, real cost math, building datasets from production traces, CI/CD integration you own, production monitoring, and the failure modes of eval systems themselves.

Quick Reference

  • Match eval investment to traffic — manual spot-checks are appropriate for under 100 daily users; automated pipelines earn their cost above 1,000
  • Deterministic eval (schema, regex, exact match) covers structured outputs for $0 — start here before adding any LLM judges
  • LLM-as-judge with Sonnet 4.6 costs ~$0.0075/case; with Haiku it drops to ~$0.002/case — pick your judge model based on your weekly eval budget
  • Build golden datasets from production traces filtered by failures, complaints, and edge cases — not random samples
  • Run a 50-case smoke suite on every PR (under 5 min); run the full 500-case suite nightly to amortize judge costs
  • Online eval: code evaluators on 100% of traffic, LLM judge on 5–10% — 100% LLM-judge eval at 10K req/day costs $225/month with Sonnet
  • Eval systems fail silently through judge drift, dataset staleness, and narrow coverage — audit monthly
  • First 30 days: manual review (week 1), deterministic CI (week 2), LLM judge (week 3), production monitoring (week 4)

Should You Even Eval? The Over-Engineering Trap

Vibes-based testing does not scale

Without quantitative evaluation, every prompt change is a gamble. You cannot tell if your agent improved or regressed until users complain. But over-engineering is the opposite trap: a full continuous eval pipeline costs engineering time and real money. Match your investment to your actual risk and traffic.

The first decision in agent evaluation is not which framework to use — it is whether to build automated evaluation at all. A production agent serving 50 users with infrequent prompt changes does not need a CI eval pipeline. A customer-facing agent at 10,000 daily users does. The right question is: what is the cost of a regression reaching users versus the cost of building and maintaining eval infrastructure?

StageDaily UsersWhat to BuildWeekly TimeWeekly Cost
Prototype< 100Manual spot-check 10 random traces per day30 min$0
Early Production100–1K50-case golden set + deterministic CI checks2 hours$0–5
Growth1K–10KLLM-as-judge + CI quality gates + production sampling4 hours$20–100
Scale> 10KContinuous pipeline + online eval + alertingOngoing$200+
The cheapest eval is the one you do not need

If your agent outputs structured data — JSON, SQL, code, classification labels — deterministic validation catches 80% of failures for $0 in judge costs. Audit your agent's output types before reaching for LLM-as-judge.