Advanced10 min
Online Evaluation: Production Monitoring
Run LLM-as-judge and code evaluators on production traces in real-time — catch quality regressions, monitor safety, and score every response without slowing down users.
Quick Reference
- →Online evaluation runs evaluators on production traces asynchronously — zero user-facing latency
- →LLM-as-judge evaluators score traces for quality, safety, tone, helpfulness in real-time
- →Code evaluators use deterministic logic (regex, keyword checks, JSON validation) for fast checks
- →Sampling rates control cost — evaluate 100% of safety checks but only 10% of quality scores
- →Filter by metadata, tool calls, or feedback to target specific trace types
- →Results appear as feedback scores in LangSmith for dashboards and alerts
Online vs. Offline Evaluation
Production traces → sampled evaluators (LLM judge + code + safety) → dashboard + alerts
| Aspect | Offline Evaluation | Online Evaluation |
|---|---|---|
| When | Before deployment (CI/CD) | After deployment (production) |
| Data | Curated datasets with expected outputs | Real user traces without reference answers |
| Purpose | Regression testing, benchmarking | Quality monitoring, anomaly detection |
| Latency impact | None (separate pipeline) | None (async, post-response) |
| Cost model | Fixed (dataset size × evaluator cost) | Variable (traffic × sampling rate × evaluator cost) |
| Feedback loop | Fix before deploy | Alert, investigate, hotfix |
Both are essential
Offline eval catches known issues before deployment. Online eval catches unknown issues in production — novel query patterns, edge cases in real data, and gradual quality drift that curated datasets miss.