Evaluator-Optimizer: Self-Improving Loops
Two LLMs, two roles: one generates, one judges. The Evaluator-Optimizer pattern runs a structured feedback loop until output clears a quality threshold — or a cost budget runs out. Before using it, you need to answer two questions: does your task have measurable quality criteria, and does it actually benefit from more than one attempt?
Quick Reference
- →Generator produces output → Evaluator scores it → loop until quality threshold is met or max iterations hit
- →Use two separate LLM roles (different prompts, optionally different models) — not self-critique
- →Use structured output for evaluation: score (1-10), feedback string, passes boolean
- →Default max iterations: 3 — most improvement happens in iterations 1-2
- →A 3-iteration loop costs ~4-6× a single pass due to context accumulation
- →Prefer evaluator-optimizer over reflection when the evaluator needs independent judgment
- →Best for: code generation, long-form content, data extraction with validation schemas
- →Plateau detection: stop when score doesn't improve across 2 rounds — you've hit a ceiling
Should I Use This? (vs Reflection, vs Retry)
Evaluator-Optimizer separates generation from judgment. Reflection combines them.
Three patterns create feedback loops in agents: Evaluator-Optimizer, Reflection, and retry-with-validation. They look similar in architecture but differ in when to apply them. Choosing wrong adds latency and cost without improving output.
| Pattern | Structure | Use when | Avoid when |
|---|---|---|---|
| Evaluator-Optimizer | Two roles: generator + separate evaluator | Evaluation requires judgment separate from generation (code security review, editorial quality, schema compliance) | The same model could self-critique accurately — you're paying 2× for no benefit |
| Reflection | One role: generates then self-critiques | The generator has enough domain knowledge to evaluate its own output; lower cost priority | Evaluation requires independence — the model will rationalize its own errors |
| Retry-with-validation | One role: retry on parse/schema failure | Output must match a strict schema (JSON, function signature); failures are deterministic | Quality improvement is needed — validation only catches structural errors, not semantic ones |
Ask: 'Could the generator plausibly convince itself its flawed output is fine?' If yes, use Evaluator-Optimizer with separate evaluation logic. Reflection works best when the flaw is obvious — a missing bracket, a wrong format — not when it requires domain judgment the generator might rationalize away.