Agent Architecture/Single-Agent Patterns
Intermediate16 min

Reflection & Self-Critique

When to add a self-critique loop, what it costs, where it fails, and how to measure whether it's earning its latency. Includes conditional-edge and Command-based LangGraph implementations, production-shaped code with regression guards, and a clear comparison with Evaluator-Optimizer.

Quick Reference

  • Reflection = a second LLM call that scores the first output and decides whether to retry
  • Use a cheaper model as critic (Claude Haiku 4.5, GPT-5.4 mini, Gemini 3.1 Flash) — the critic role doesn't need frontier capability
  • Cap iteration depth at 2-3: beyond that, cost multiplies without proportional quality gain
  • Add a regression guard: if attempt N scores lower than attempt N-1, exit with the previous best draft
  • Reflection vs Evaluator-Optimizer: reflection is same-agent self-check; Eval-Opt is a two-role specialized loop — if you need fine-grained optimization feedback, use Eval-Opt
  • The Command pattern (LangGraph v1.0+) is cleaner than add_conditional_edges for reflection routing — routing logic lives inside the reflect node
  • Gate reflection behind a latency budget check: never add a blocking LLM call to a user-facing synchronous path without measuring the cost

Should I Use Reflection?

Reflection adds latency and cost to every invocation. Before wiring it in, verify that the quality gains justify both. The honest answer is: reflection helps less often than people expect, and fails in ways that are hard to detect.

ScenarioReflection valueWhy
Code generationHighCritic can check syntax, missing edge cases, and security issues without needing ground truth
Research summariesHighCritic can verify claims against the source material provided in context
Email / document draftsMedium — gate itUseful for tone and structure; diminishing returns after one revision cycle
Creative writingLow-mediumStyle critiques loop indefinitely; define a termination rubric before enabling
Simple Q&ALowFactual errors require retrieval, not critique — reflection doesn't fix what it can't source
Tool-based lookupsNoneThe tool result is the answer; critiquing it adds latency without changing the facts
When NOT to reflectWhy reflection fails hereWhat to do instead
User-facing synchronous chatEach retry adds a full LLM round-trip; users notice latency above ~2sRun reflection offline on a sample; use findings to improve the generator prompt
Generator already passes >95% of first-pass evalsReflection fires on every call but improves almost none — net negative ROIMonitor first-pass quality; only re-enable if regression drops below threshold
Task where tool results are the answerThe critic evaluates correct data as if it were a prose draft and 'improves' it into something wrongValidate tool outputs with deterministic checks, not LLM critique
Open-ended creative tasks with no rubricWithout explicit criteria, the critic invents improvement axes — quality becomes randomDefine your rubric before enabling reflection, or use human review instead
The reflection gate

Reflection earns its keep when all three are true: (1) quality criteria are expressible as a rubric the LLM can follow, (2) first-pass quality is measurably below your production threshold on a meaningful fraction of inputs, AND (3) the latency budget accommodates at least 1 additional LLM call on every invocation.

Real project

A customer-facing email agent added reflection to catch tone issues. It improved quality on roughly 15% of drafts — flagging overly technical language, missing empathy phrases. But it added 1.2s median latency to every response, even the 85% that needed no revision. After 3 weeks of LangSmith data, the team moved reflection behind a confidence gate: only trigger if the generator's self-reported tone_confidence < 0.7. Reflection now fires on ~20% of emails, quality stayed the same, and median latency dropped by 0.9s.

Learn this in → measuring-reflection