Evaluation & Quality/Automated Evaluation
Advanced11 min

Domain-Specific Evaluation

Generic evaluation metrics miss domain-specific quality. Code generation needs execution tests. Summarization needs faithfulness checks. Classification needs confusion matrices. This article builds custom evaluators for the most common AI application domains, with production-ready code for each.

Quick Reference

  • Code generation: pass@k (probability of at least one correct solution in k samples) is the standard metric
  • Summarization: evaluate faithfulness (no hallucination), coverage (key points included), and conciseness separately
  • Classification: precision/recall/F1 per class, macro vs micro averaging, confusion matrix analysis
  • Information extraction: exact match, fuzzy match, and field-level metrics for structured output
  • Build domain evaluators as composable functions that return standardized score dictionaries

Code Generation Evaluation

Code is unique among AI outputs: it can be objectively verified by execution. Does it compile? Does it pass tests? Does it produce the correct output? This makes code generation evaluation more rigorous than most domains — but execution-based evaluation has its own challenges, including security (running untrusted code), environment setup, and flaky tests.

Never execute untrusted code without sandboxing

LLM-generated code can contain anything — file system operations, network calls, infinite loops. Always execute in a sandboxed environment (Docker container, E2B, Modal sandbox) with resource limits (CPU time, memory, network access).

Code generation evaluation with execution-based testing and pass@k