Reflection & Self-Critique
When to add a self-critique loop, what it costs, where it fails, and how to measure whether it's earning its latency. Includes conditional-edge and Command-based LangGraph implementations, production-shaped code with regression guards, and a clear comparison with Evaluator-Optimizer.
Quick Reference
- →Reflection = a second LLM call that scores the first output and decides whether to retry
- →Use a cheaper model as critic (Claude Haiku 4.5, GPT-5.4 mini, Gemini 3.1 Flash) — the critic role doesn't need frontier capability
- →Cap iteration depth at 2-3: beyond that, cost multiplies without proportional quality gain
- →Add a regression guard: if attempt N scores lower than attempt N-1, exit with the previous best draft
- →Reflection vs Evaluator-Optimizer: reflection is same-agent self-check; Eval-Opt is a two-role specialized loop — if you need fine-grained optimization feedback, use Eval-Opt
- →The Command pattern (LangGraph v1.0+) is cleaner than add_conditional_edges for reflection routing — routing logic lives inside the reflect node
- →Gate reflection behind a latency budget check: never add a blocking LLM call to a user-facing synchronous path without measuring the cost
Should I Use Reflection?
Reflection adds latency and cost to every invocation. Before wiring it in, verify that the quality gains justify both. The honest answer is: reflection helps less often than people expect, and fails in ways that are hard to detect.
| Scenario | Reflection value | Why |
|---|---|---|
| Code generation | High | Critic can check syntax, missing edge cases, and security issues without needing ground truth |
| Research summaries | High | Critic can verify claims against the source material provided in context |
| Email / document drafts | Medium — gate it | Useful for tone and structure; diminishing returns after one revision cycle |
| Creative writing | Low-medium | Style critiques loop indefinitely; define a termination rubric before enabling |
| Simple Q&A | Low | Factual errors require retrieval, not critique — reflection doesn't fix what it can't source |
| Tool-based lookups | None | The tool result is the answer; critiquing it adds latency without changing the facts |
| When NOT to reflect | Why reflection fails here | What to do instead |
|---|---|---|
| User-facing synchronous chat | Each retry adds a full LLM round-trip; users notice latency above ~2s | Run reflection offline on a sample; use findings to improve the generator prompt |
| Generator already passes >95% of first-pass evals | Reflection fires on every call but improves almost none — net negative ROI | Monitor first-pass quality; only re-enable if regression drops below threshold |
| Task where tool results are the answer | The critic evaluates correct data as if it were a prose draft and 'improves' it into something wrong | Validate tool outputs with deterministic checks, not LLM critique |
| Open-ended creative tasks with no rubric | Without explicit criteria, the critic invents improvement axes — quality becomes random | Define your rubric before enabling reflection, or use human review instead |
Reflection earns its keep when all three are true: (1) quality criteria are expressible as a rubric the LLM can follow, (2) first-pass quality is measurably below your production threshold on a meaningful fraction of inputs, AND (3) the latency budget accommodates at least 1 additional LLM call on every invocation.
A customer-facing email agent added reflection to catch tone issues. It improved quality on roughly 15% of drafts — flagging overly technical language, missing empathy phrases. But it added 1.2s median latency to every response, even the 85% that needed no revision. After 3 weeks of LangSmith data, the team moved reflection behind a confidence gate: only trigger if the generator's self-reported tone_confidence < 0.7. Reflection now fires on ~20% of emails, quality stayed the same, and median latency dropped by 0.9s.
Learn this in → measuring-reflection