Evaluator-Optimizer: Self-Improving Loops

Two LLMs, two roles: one generates, one judges. The Evaluator-Optimizer pattern runs a structured feedback loop until output clears a quality threshold — or a cost budget runs out. Before using it, you need to answer two questions: does your task have measurable quality criteria, and does it actually benefit from more than one attempt?

Quick Reference

→Generator produces output → Evaluator scores it → loop until quality threshold is met or max iterations hit
→Use two separate LLM roles (different prompts, optionally different models) — not self-critique
→Use structured output for evaluation: score (1-10), feedback string, passes boolean
→Default max iterations: 3 — most improvement happens in iterations 1-2
→A 3-iteration loop costs ~4-6× a single pass due to context accumulation
→Prefer evaluator-optimizer over reflection when the evaluator needs independent judgment
→Best for: code generation, long-form content, data extraction with validation schemas
→Plateau detection: stop when score doesn't improve across 2 rounds — you've hit a ceiling

Should I Use This? (vs Reflection, vs Retry)

Evaluator-Optimizer separates generation from judgment. Reflection combines them.

Three patterns create feedback loops in agents: Evaluator-Optimizer, Reflection, and retry-with-validation. They look similar in architecture but differ in when to apply them. Choosing wrong adds latency and cost without improving output.

Pattern	Structure	Use when	Avoid when
Evaluator-Optimizer	Two roles: generator + separate evaluator	Evaluation requires judgment separate from generation (code security review, editorial quality, schema compliance)	The same model could self-critique accurately — you're paying 2× for no benefit
Reflection	One role: generates then self-critiques	The generator has enough domain knowledge to evaluate its own output; lower cost priority	Evaluation requires independence — the model will rationalize its own errors
Retry-with-validation	One role: retry on parse/schema failure	Output must match a strict schema (JSON, function signature); failures are deterministic	Quality improvement is needed — validation only catches structural errors, not semantic ones

The independence test

Ask: 'Could the generator plausibly convince itself its flawed output is fine?' If yes, use Evaluator-Optimizer with separate evaluation logic. Reflection works best when the flaw is obvious — a missing bracket, a wrong format — not when it requires domain judgment the generator might rationalize away.

Evaluator-Optimizer: Self-Improving Loops

Should I Use This? (vs Reflection, vs Retry)

How the Loop Works

What Does It Cost?

Sign in to read this article