Online Evaluation: Production Monitoring

Run LLM-as-judge and code evaluators on production traces in real-time — catch quality regressions, monitor safety, and score every response without slowing down users.

Quick Reference

→Online evaluation runs evaluators on production traces asynchronously — zero user-facing latency
→LLM-as-judge evaluators score traces for quality, safety, tone, helpfulness in real-time
→Code evaluators use deterministic logic (regex, keyword checks, JSON validation) for fast checks
→Sampling rates control cost — evaluate 100% of safety checks but only 10% of quality scores
→Filter by metadata, tool calls, or feedback to target specific trace types
→Results appear as feedback scores in LangSmith for dashboards and alerts

Online vs. Offline Evaluation

Production traces → sampled evaluators (LLM judge + code + safety) → dashboard + alerts

Aspect	Offline Evaluation	Online Evaluation
When	Before deployment (CI/CD)	After deployment (production)
Data	Curated datasets with expected outputs	Real user traces without reference answers
Purpose	Regression testing, benchmarking	Quality monitoring, anomaly detection
Latency impact	None (separate pipeline)	None (async, post-response)
Cost model	Fixed (dataset size × evaluator cost)	Variable (traffic × sampling rate × evaluator cost)
Feedback loop	Fix before deploy	Alert, investigate, hotfix

Both are essential

Offline eval catches known issues before deployment. Online eval catches unknown issues in production — novel query patterns, edge cases in real data, and gradual quality drift that curated datasets miss.

LLM-as-Judge Online Evaluators

LangSmith online evaluators run an LLM judge on production traces asynchronously. You define the scoring criteria, select which traces to evaluate (via filters and sampling), and the evaluator scores each matching trace. Results appear as feedback scores on the trace.

Code Evaluators

Code evaluators use deterministic logic for fast, cheap evaluation. They're ideal for checks that don't need semantic understanding — format validation, keyword detection, response length, latency thresholds.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.