Evaluation & Quality/Automated Evaluation
Advanced10 min

Online Evaluation: Production Monitoring

Run LLM-as-judge and code evaluators on production traces in real-time — catch quality regressions, monitor safety, and score every response without slowing down users.

Quick Reference

  • Online evaluation runs evaluators on production traces asynchronously — zero user-facing latency
  • LLM-as-judge evaluators score traces for quality, safety, tone, helpfulness in real-time
  • Code evaluators use deterministic logic (regex, keyword checks, JSON validation) for fast checks
  • Sampling rates control cost — evaluate 100% of safety checks but only 10% of quality scores
  • Filter by metadata, tool calls, or feedback to target specific trace types
  • Results appear as feedback scores in LangSmith for dashboards and alerts

Online vs. Offline Evaluation

ProductionAgentTraceLangSmith10% sampleLLM-as-JudgehelpfulnessCode Evaluatorformat, latencySafety Check100% coverageDashboardScores + TrendsAlerts at threshold→ Annotation queue

Production traces → sampled evaluators (LLM judge + code + safety) → dashboard + alerts

AspectOffline EvaluationOnline Evaluation
WhenBefore deployment (CI/CD)After deployment (production)
DataCurated datasets with expected outputsReal user traces without reference answers
PurposeRegression testing, benchmarkingQuality monitoring, anomaly detection
Latency impactNone (separate pipeline)None (async, post-response)
Cost modelFixed (dataset size × evaluator cost)Variable (traffic × sampling rate × evaluator cost)
Feedback loopFix before deployAlert, investigate, hotfix
Both are essential

Offline eval catches known issues before deployment. Online eval catches unknown issues in production — novel query patterns, edge cases in real data, and gradual quality drift that curated datasets miss.