Intermediate14 min

Hallucinations: The Engineering Response

LLMs hallucinate because they generate statistically plausible tokens, not verified facts. This article gives you real failure rates by domain, the full defense-in-depth stack with cost math, working code for Anthropic's Citations API, and an eval harness to measure hallucination in your own system.

Quick Reference

→Hallucination is inherent to next-token prediction — not a bug, not fixable by prompting alone
→Real rates (early 2026): legal research 58–88%, medical without grounding >60%, summarization with grounding <2%
→Five types: factual, faithful, instruction, attribution, reasoning — faithful hallucination defeats RAG if unchecked
→Anthropic Citations API: the single most impactful tool for RAG grounding — pins responses to exact source text
→Structured outputs eliminate free-form hallucination — use them whenever the output space is bounded
→Measure with LLM-as-judge: score faithfulness, groundedness, and answer relevance against a reference set
→Design principle: LLM proposes, deterministic system verifies — never let an LLM execute irreversible actions

In this article

1.How Bad Is the Problem? Real Numbers
2.The Five Types (and Which Ones Kill You)
3.When Models Hallucinate Most
4.Defense in Depth: The Mitigation Stack
5.The Citations API: Grounding Done Right
6.How to Measure Hallucination in Your System
7.Designing Systems That Handle Hallucination
8.First-30-Days Checklist
★Best Practices
✓Key Takeaways

How Bad Is the Problem? Real Numbers

Before building defenses, you need to know what you're defending against. Hallucination rates vary enormously by domain and task type — and the numbers are worse than most engineers expect.

Hallucination rates by domain — as of early 2026. Grounding collapses the gap.

Domain / Task	Hallucination rate (early 2026)	Source
Legal research citations	58–88%	Domain leaderboard analysis
Medical Q&A without grounding	>60%	Medical AI benchmarks
General knowledge (SimpleQA)	~47% (GPT-5)	OpenAI SimpleQA benchmark
API knowledge in code generation	~20%	ACM code generation study
Open-ended generation	40–80%	Academic NLP benchmarks
Summarization with grounding	<2%	Multiple 2025 summarization studies

The confidence trap

Hallucination rate has no correlation with how assertively the model writes. A model that says 'According to the report, revenue was $8M' may be wrong just as often as one that hedges — the phrasing reflects writing style, not certainty. Design for this: never trust tone as a proxy for accuracy.

Real project

A legal tech startup built a contract analysis tool using GPT-4 without grounding. In internal testing, the tool looked excellent — responses were well-structured and confident. In production, attorneys found that roughly 1 in 4 citations to specific contract clauses was fabricated: correct clause number, wrong content, or the clause simply didn't exist. The fix required a full redesign: retrieve the actual clause text first, then ask the model to analyze only what was retrieved. Hallucination on cited clauses dropped to near zero. The lesson: confident output is not grounded output.

Learn this in → This is why the Citations API exists — it forces grounding at the API level, not just in your prompts.

The Five Types (and Which Ones Kill You)

Not all hallucinations are equal. The type determines both how hard it is to detect and what defense is most effective.

Type	What happens	Example	Detection difficulty	Primary defense
Factual	States something factually wrong	"Python was created in 1995" (was 1991)	Hard — requires external truth source	External verification or grounded retrieval
Faithful	Contradicts provided context	Document says $5M, model says $8M	Moderate — compare output to source	Citations API or explicit quote requirement
Attribution	Fabricates sources, citations, URLs	"Smith et al. (2023) in Nature..." (paper doesn't exist)	Easy if you verify; invisible if you don't	Never trust LLM-generated citations; always verify
Instruction	Ignores or misreads explicit instructions	Asked for JSON, returns markdown	Easy — validate output schema	Structured outputs / constrained decoding
Reasoning	Wrong conclusion from correct premises	Correct arithmetic steps, wrong final answer	Hard — requires step-by-step verification	Chain-of-thought with explicit step checks

Faithful hallucination defeats RAG if unchecked

The whole point of RAG is to ground responses in retrieved documents. Faithful hallucination — where the model contradicts the context you provided — is the failure mode that makes RAG worthless. A model can retrieve the right document and still misrepresent what it says. This is why the Citations API exists: it forces the model to pin each claim to exact source text rather than paraphrase freely.

When Models Hallucinate Most

Hallucinations follow predictable patterns. Knowing the triggers helps you place guardrails where they matter.

▸Rare or niche topics: less training data → less reliable pattern completion. A model knows common Python APIs cold; it may fabricate obscure library internals.
▸Specific numbers, dates, and proper nouns: precise facts are stored poorly in neural network weights. Ask for "the exact revenue" and the model will invent a plausible-sounding figure.
▸Recent events past training cutoff: the model has no information but will still generate confident-sounding answers about events it cannot know.
▸Long reasoning chains: errors compound. A three-step derivation with 95% accuracy per step is only ~86% accurate at the end.
▸Forced answering: if there's no 'I don't know' option in your prompt, the model will fabricate rather than abstain.
▸Low-resource languages: models have less training signal in non-English languages and hallucinate more frequently.
▸Domain-specific jargon with multiple meanings: legal, medical, and financial terms that mean different things in different contexts cause cross-domain confusion.

The prompt structure that invites vs prevents hallucination

Defense in Depth: The Mitigation Stack

No single technique eliminates hallucination. The right approach is layered defenses proportional to the stakes. Each layer catches what the previous one misses. The cost of each layer is real — match depth to what a hallucination actually costs you.

Each layer catches what the previous misses. Match depth to cost of failure.

Layer	What it catches	Cost	When to use
Prompt design (abstain permission)	Forced-answering hallucination	Zero	Always — no cost not to
RAG / document grounding	Factual hallucination on in-scope topics	Retrieval pipeline + ~10% more input tokens	Any factual Q&A system
Citations API (Anthropic)	Faithful hallucination — catches model contradicting source	Slight input token increase; cited_text not billed as output	All RAG systems — replaces prompt-based citation hacks
Structured output / constrained decoding	Instruction hallucination, schema violations	Near-zero overhead (FSM-based token masking)	Any bounded output space
External verification	Factual + attribution hallucination	API call or DB query per claim	High-stakes claims: numbers, citations, URLs
Human review gate	All types	Human time — highest cost	Critical paths: medical, legal, financial

Match defense depth to cost of failure

A hallucination in a creative brainstorming tool costs nothing. The same hallucination in a medical diagnosis tool could kill someone. Map your use case to a defense tier: Low stakes → prompt design + structured output only. Medium stakes → add RAG + Citations API. High stakes → add external verification. Critical → add human review gate.

The Citations API: Grounding Done Right

Anthropic's Citations API (GA since January 2025) is the most production-ready solution for faithful hallucination in RAG systems. Instead of asking the model to quote from documents in its text output, the API pins each claim to exact source passages at the API level — guaranteeing citations are valid pointers, not fabrications.

What makes Citations different from prompt-based quoting

Prompt-based: 'Please cite your sources' → model generates citation text as output tokens (billed, potentially fabricated). Citations API: model outputs citation metadata pointing to character ranges in source documents. The cited_text field is returned for convenience but not billed as output tokens. Citations are structurally guaranteed to point to text that exists in the document you provided.

Citations API — production RAG with guaranteed grounding

Citations API and Structured Outputs are incompatible

You cannot enable both citations and structured output (output_config.format) in the same request. The API returns a 400 error. This is by design: citations require interleaving citation metadata with text output, which is incompatible with strict JSON schema constraints. Choose one based on your use case: citations for RAG grounding, structured output for bounded output spaces.

Combine Citations with prompt caching for large document sets

Apply cache_control: {type: 'ephemeral'} to your document content blocks. The document content is cached for up to 5 minutes; subsequent requests using the same document benefit from cache hits. cited_text is not counted toward output tokens in either cached or uncached requests.

How to Measure Hallucination in Your System

The article has told you to 'monitor hallucination rates in production.' Here's what that actually means in code.

Citation coverage + LLM-as-judge score → threshold gate → route to pass, flag, or reject

▸Reference-based: compare model output against a ground-truth answer set. High precision, requires curated test set, expensive to build.
▸LLM-as-judge: use a separate model call to score faithfulness and groundedness. Low setup cost, scales to production, requires calibration against human judgments.
▸Citation coverage: for grounded systems, track what fraction of claims have a citation. Low coverage = model is generating outside document scope.
▸User correction rate: log when users edit or reject outputs. Noisy but reflects real-world failure. Good for monitoring drift over time.

LLM-as-judge hallucination scorer — production-ready pattern

Set a hallucination rate threshold as a CI gate

Run your LLM-as-judge scorer on a fixed 50–100 question eval set in CI. Fail the build if hallucination_detected rate exceeds your threshold (e.g., >5% for medium-stakes, >1% for high-stakes). This catches regressions when you change prompts, switch models, or update retrieval. Without this gate, you won't know you've broken grounding until users tell you.

Designing Systems That Handle Hallucination

The most important insight about hallucination: you cannot eliminate it, so design for it. The architecture depends entirely on the stakes.

Stakes level	Example use cases	Appropriate architecture
Low	Creative brainstorming, draft generation, code autocomplete	Accept hallucination — it's a feature (creativity). No special defense needed.
Medium	Customer support, internal Q&A, code suggestions	Citations API + structured output + LLM-as-judge scorer in CI. Flag low-confidence responses for human review.
High	Medical information, legal research, financial analysis	Full defense stack: grounded retrieval + Citations API + external claim verification + human review gate before delivery.
Critical	Clinical decisions, legal filings, financial transactions	LLM proposes → deterministic system verifies → human executes. LLM never directly acts.

The classification trick: bound the output space

LLMs hallucinate least when choosing from a fixed set of options. They hallucinate most when generating free-form text. Whenever possible, reframe the task: instead of 'explain the customer's issue,' ask 'classify this issue as: billing | technical | account | other.' Classification plus structured output gives you mathematical guarantees via constrained decoding — the model cannot output something outside the schema.

▸Separate LLM reasoning from action execution — the model decides what to do, a deterministic system does it
▸Build abstain paths — 'I don't have this information' is a correct answer; design your UX to show it
▸Log every hallucination your eval catches — they are a training signal and a product insight
▸When users correct outputs, that's a hallucination signal — route corrections back into your eval set
▸Monitor hallucination rates over time — prompt drift, model updates, and retrieval quality changes all affect rates

First-30-Days Checklist

These steps are ordered. Each one unlocks the next. Don't skip to external verification before you've built grounding — you won't know what you're verifying.

▸Day 1–3: Add abstain permission to all prompts. Give every LLM call an explicit 'say I don't know if the answer isn't available' instruction. Zero cost, immediate improvement.
▸Day 4–7: Add structured output to every endpoint with a bounded output space. If the response is JSON, schema, or a classification — use constrained decoding. This eliminates instruction hallucination.
▸Day 8–14: Build a 50-question eval set for your primary use case. Include cases where the answer is NOT in the document. Run LLM-as-judge scorer. Establish your baseline hallucination rate.
▸Day 15–21: Integrate Citations API for any RAG-based flow. Compare your before/after grounding score on the eval set. Expect a significant drop in faithful hallucination.
▸Day 22–28: Add the eval set to CI. Gate on hallucination rate. You now have a regression detector — you'll know immediately when a prompt change or model update degrades grounding.
▸Day 29–30: Add monitoring. Log hallucination_detected=true events from your scorer in production. Set an alert if rate exceeds threshold. You now have visibility into production drift.

Best Practices

✓Add explicit abstain permission to every prompt: 'If the answer isn't in the provided context, say so'
✓Use the Citations API for all RAG systems — it pins responses to exact source text at the API level
✓Use structured output (constrained decoding) for any bounded output space — eliminates instruction hallucination
✓Build an eval set of 50+ questions including unanswerable cases, run LLM-as-judge weekly
✓Gate CI on hallucination rate — fail the build if grounding score drops below threshold
✓Monitor citation coverage in production — low coverage signals the model is generating outside document scope
✓Match defense depth to failure cost: low stakes needs only abstain + structured output; high stakes needs full stack
✓Log user corrections as a hallucination signal and route them into your eval set
✓Use a fast model (claude-haiku-4-5-20251001) for LLM-as-judge scoring to keep eval costs down

Don’t

✗Don't trust LLM-generated citations, URLs, or paper references without external verification — attribution hallucination is very common
✗Don't use tone or confidence as a proxy for accuracy — assertive phrasing does not correlate with correctness
✗Don't force the model to always answer — a well-designed abstain is more trustworthy than a fabricated answer
✗Don't run self-consistency checks with string matching — semantic equivalence requires semantic comparison, not Counter()
✗Don't assume asking 'Are you sure?' catches hallucination — the model will say 'Yes, I'm sure' and still be wrong
✗Don't use LLMs for precise numerical calculations, date lookups, or tasks requiring exact recall — use tools or databases
✗Don't combine Citations API with Structured Outputs in the same request — the API returns a 400 error
✗Don't rely on a single defense layer for high-stakes use cases — hallucination defense must be layered
✗Don't skip the eval set — monitoring without a baseline is just watching numbers you can't interpret

Key Takeaways

✓Hallucination rates range from <2% (summarization with grounding) to >60% (medical without grounding) — know your domain's baseline before building defenses.
✓Faithful hallucination defeats RAG: a model can retrieve the right document and still contradict it. The Citations API pins claims to exact source text at the API level.
✓Five hallucination types require different defenses: faithful → Citations API; instruction → structured output; attribution → external verification; factual → RAG grounding; reasoning → step-by-step verification.
✓Structured outputs eliminate instruction hallucination via constrained decoding — the model cannot generate tokens outside your schema.
✓Measure with LLM-as-judge: score faithfulness and groundedness on a fixed eval set in CI. Gate on hallucination rate to catch regressions before production.
✓Design principle: LLM proposes, deterministic system verifies, human executes — never let an LLM directly take irreversible actions in high-stakes systems.

Video on this topic

Why ChatGPT makes things up (and always will)

tiktok

←

Context Windows & Context Management

Model Families Compared

→