Agent Architecture/Single-Agent Patterns
Advanced14 min

Evaluator-Optimizer: Self-Improving Loops

Two LLMs, two roles: one generates, one judges. The Evaluator-Optimizer pattern runs a structured feedback loop until output clears a quality threshold — or a cost budget runs out. Before using it, you need to answer two questions: does your task have measurable quality criteria, and does it actually benefit from more than one attempt?

Quick Reference

  • Generator produces output → Evaluator scores it → loop until quality threshold is met or max iterations hit
  • Use two separate LLM roles (different prompts, optionally different models) — not self-critique
  • Use structured output for evaluation: score (1-10), feedback string, passes boolean
  • Default max iterations: 3 — most improvement happens in iterations 1-2
  • A 3-iteration loop costs ~4-6× a single pass due to context accumulation
  • Prefer evaluator-optimizer over reflection when the evaluator needs independent judgment
  • Best for: code generation, long-form content, data extraction with validation schemas
  • Plateau detection: stop when score doesn't improve across 2 rounds — you've hit a ceiling

Should I Use This? (vs Reflection, vs Retry)

Evaluator-OptimizerGenerator LLMproduces outputEvaluator LLMscores + structured feedbackfeedbackTwo roles — independent judgmentCan use different models per roleReflectionGenerator + Criticsame model, different promptgenerates then self-critiquesloopOne role — simpler, cheaperRisk: model rationalizes own errors

Evaluator-Optimizer separates generation from judgment. Reflection combines them.

Three patterns create feedback loops in agents: Evaluator-Optimizer, Reflection, and retry-with-validation. They look similar in architecture but differ in when to apply them. Choosing wrong adds latency and cost without improving output.

PatternStructureUse whenAvoid when
Evaluator-OptimizerTwo roles: generator + separate evaluatorEvaluation requires judgment separate from generation (code security review, editorial quality, schema compliance)The same model could self-critique accurately — you're paying 2× for no benefit
ReflectionOne role: generates then self-critiquesThe generator has enough domain knowledge to evaluate its own output; lower cost priorityEvaluation requires independence — the model will rationalize its own errors
Retry-with-validationOne role: retry on parse/schema failureOutput must match a strict schema (JSON, function signature); failures are deterministicQuality improvement is needed — validation only catches structural errors, not semantic ones
The independence test

Ask: 'Could the generator plausibly convince itself its flawed output is fine?' If yes, use Evaluator-Optimizer with separate evaluation logic. Reflection works best when the flaw is obvious — a missing bracket, a wrong format — not when it requires domain judgment the generator might rationalize away.