Reflection & Self-Critique

When to add a self-critique loop, what it costs, where it fails, and how to measure whether it's earning its latency. Includes conditional-edge and Command-based LangGraph implementations, production-shaped code with regression guards, and a clear comparison with Evaluator-Optimizer.

Quick Reference

→Reflection = a second LLM call that scores the first output and decides whether to retry
→Use a cheaper model as critic (Claude Haiku 4.5, GPT-5.4 mini, Gemini 3.1 Flash) — the critic role doesn't need frontier capability
→Cap iteration depth at 2-3: beyond that, cost multiplies without proportional quality gain
→Add a regression guard: if attempt N scores lower than attempt N-1, exit with the previous best draft
→Reflection vs Evaluator-Optimizer: reflection is same-agent self-check; Eval-Opt is a two-role specialized loop — if you need fine-grained optimization feedback, use Eval-Opt
→The Command pattern (LangGraph v1.0+) is cleaner than add_conditional_edges for reflection routing — routing logic lives inside the reflect node
→Gate reflection behind a latency budget check: never add a blocking LLM call to a user-facing synchronous path without measuring the cost

Should I Use Reflection?

Reflection adds latency and cost to every invocation. Before wiring it in, verify that the quality gains justify both. The honest answer is: reflection helps less often than people expect, and fails in ways that are hard to detect.

Scenario	Reflection value	Why
Code generation	High	Critic can check syntax, missing edge cases, and security issues without needing ground truth
Research summaries	High	Critic can verify claims against the source material provided in context
Email / document drafts	Medium — gate it	Useful for tone and structure; diminishing returns after one revision cycle
Creative writing	Low-medium	Style critiques loop indefinitely; define a termination rubric before enabling
Simple Q&A	Low	Factual errors require retrieval, not critique — reflection doesn't fix what it can't source
Tool-based lookups	None	The tool result is the answer; critiquing it adds latency without changing the facts

When NOT to reflect	Why reflection fails here	What to do instead
User-facing synchronous chat	Each retry adds a full LLM round-trip; users notice latency above ~2s	Run reflection offline on a sample; use findings to improve the generator prompt
Generator already passes >95% of first-pass evals	Reflection fires on every call but improves almost none — net negative ROI	Monitor first-pass quality; only re-enable if regression drops below threshold
Task where tool results are the answer	The critic evaluates correct data as if it were a prose draft and 'improves' it into something wrong	Validate tool outputs with deterministic checks, not LLM critique
Open-ended creative tasks with no rubric	Without explicit criteria, the critic invents improvement axes — quality becomes random	Define your rubric before enabling reflection, or use human review instead

The reflection gate

Reflection earns its keep when all three are true: (1) quality criteria are expressible as a rubric the LLM can follow, (2) first-pass quality is measurably below your production threshold on a meaningful fraction of inputs, AND (3) the latency budget accommodates at least 1 additional LLM call on every invocation.

Real project

A customer-facing email agent added reflection to catch tone issues. It improved quality on roughly 15% of drafts — flagging overly technical language, missing empathy phrases. But it added 1.2s median latency to every response, even the 85% that needed no revision. After 3 weeks of LangSmith data, the team moved reflection behind a confidence gate: only trigger if the generator's self-reported tone_confidence < 0.7. Reflection now fires on ~20% of emails, quality stayed the same, and median latency dropped by 0.9s.

Learn this in → measuring-reflection

Reflection vs. Evaluator-Optimizer

The next article in this chapter is Evaluator-Optimizer. They look similar from a distance — both loop on a draft. The difference matters for architecture decisions:

What Will Reflection Cost?

Every reflection iteration is an extra LLM call. The cost compounds because the critic receives the full draft as input — so critic input tokens grow with draft length. Use this formula to budget before enabling:

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.