Hallucinations: The Engineering Response
LLMs hallucinate because they generate statistically plausible tokens, not verified facts. This article gives you real failure rates by domain, the full defense-in-depth stack with cost math, working code for Anthropic's Citations API, and an eval harness to measure hallucination in your own system.
Quick Reference
- →Hallucination is inherent to next-token prediction — not a bug, not fixable by prompting alone
- →Real rates (early 2026): legal research 58–88%, medical without grounding >60%, summarization with grounding <2%
- →Five types: factual, faithful, instruction, attribution, reasoning — faithful hallucination defeats RAG if unchecked
- →Anthropic Citations API: the single most impactful tool for RAG grounding — pins responses to exact source text
- →Structured outputs eliminate free-form hallucination — use them whenever the output space is bounded
- →Measure with LLM-as-judge: score faithfulness, groundedness, and answer relevance against a reference set
- →Design principle: LLM proposes, deterministic system verifies — never let an LLM execute irreversible actions
In this article
- 1.How Bad Is the Problem? Real Numbers
- 2.The Five Types (and Which Ones Kill You)
- 3.When Models Hallucinate Most
- 4.Defense in Depth: The Mitigation Stack
- 5.The Citations API: Grounding Done Right
- 6.How to Measure Hallucination in Your System
- 7.Designing Systems That Handle Hallucination
- 8.First-30-Days Checklist
- ★Best Practices
- ✓Key Takeaways
How Bad Is the Problem? Real Numbers
Before building defenses, you need to know what you're defending against. Hallucination rates vary enormously by domain and task type — and the numbers are worse than most engineers expect.
Hallucination rates by domain — as of early 2026. Grounding collapses the gap.
| Domain / Task | Hallucination rate (early 2026) | Source |
|---|---|---|
| Legal research citations | 58–88% | Domain leaderboard analysis |
| Medical Q&A without grounding | >60% | Medical AI benchmarks |
| General knowledge (SimpleQA) | ~47% (GPT-5) | OpenAI SimpleQA benchmark |
| API knowledge in code generation | ~20% | ACM code generation study |
| Open-ended generation | 40–80% | Academic NLP benchmarks |
| Summarization with grounding | <2% | Multiple 2025 summarization studies |
Hallucination rate has no correlation with how assertively the model writes. A model that says 'According to the report, revenue was $8M' may be wrong just as often as one that hedges — the phrasing reflects writing style, not certainty. Design for this: never trust tone as a proxy for accuracy.
A legal tech startup built a contract analysis tool using GPT-4 without grounding. In internal testing, the tool looked excellent — responses were well-structured and confident. In production, attorneys found that roughly 1 in 4 citations to specific contract clauses was fabricated: correct clause number, wrong content, or the clause simply didn't exist. The fix required a full redesign: retrieve the actual clause text first, then ask the model to analyze only what was retrieved. Hallucination on cited clauses dropped to near zero. The lesson: confident output is not grounded output.
Learn this in → This is why the Citations API exists — it forces grounding at the API level, not just in your prompts.
The Five Types (and Which Ones Kill You)
Not all hallucinations are equal. The type determines both how hard it is to detect and what defense is most effective.
| Type | What happens | Example | Detection difficulty | Primary defense |
|---|---|---|---|---|
| Factual | States something factually wrong | "Python was created in 1995" (was 1991) | Hard — requires external truth source | External verification or grounded retrieval |
| Faithful | Contradicts provided context | Document says $5M, model says $8M | Moderate — compare output to source | Citations API or explicit quote requirement |
| Attribution | Fabricates sources, citations, URLs | "Smith et al. (2023) in Nature..." (paper doesn't exist) | Easy if you verify; invisible if you don't | Never trust LLM-generated citations; always verify |
| Instruction | Ignores or misreads explicit instructions | Asked for JSON, returns markdown | Easy — validate output schema | Structured outputs / constrained decoding |
| Reasoning | Wrong conclusion from correct premises | Correct arithmetic steps, wrong final answer | Hard — requires step-by-step verification | Chain-of-thought with explicit step checks |
The whole point of RAG is to ground responses in retrieved documents. Faithful hallucination — where the model contradicts the context you provided — is the failure mode that makes RAG worthless. A model can retrieve the right document and still misrepresent what it says. This is why the Citations API exists: it forces the model to pin each claim to exact source text rather than paraphrase freely.
When Models Hallucinate Most
Hallucinations follow predictable patterns. Knowing the triggers helps you place guardrails where they matter.
- ▸Rare or niche topics: less training data → less reliable pattern completion. A model knows common Python APIs cold; it may fabricate obscure library internals.
- ▸Specific numbers, dates, and proper nouns: precise facts are stored poorly in neural network weights. Ask for "the exact revenue" and the model will invent a plausible-sounding figure.
- ▸Recent events past training cutoff: the model has no information but will still generate confident-sounding answers about events it cannot know.
- ▸Long reasoning chains: errors compound. A three-step derivation with 95% accuracy per step is only ~86% accurate at the end.
- ▸Forced answering: if there's no 'I don't know' option in your prompt, the model will fabricate rather than abstain.
- ▸Low-resource languages: models have less training signal in non-English languages and hallucinate more frequently.
- ▸Domain-specific jargon with multiple meanings: legal, medical, and financial terms that mean different things in different contexts cause cross-domain confusion.
Defense in Depth: The Mitigation Stack
No single technique eliminates hallucination. The right approach is layered defenses proportional to the stakes. Each layer catches what the previous one misses. The cost of each layer is real — match depth to what a hallucination actually costs you.
Each layer catches what the previous misses. Match depth to cost of failure.
| Layer | What it catches | Cost | When to use |
|---|---|---|---|
| Prompt design (abstain permission) | Forced-answering hallucination | Zero | Always — no cost not to |
| RAG / document grounding | Factual hallucination on in-scope topics | Retrieval pipeline + ~10% more input tokens | Any factual Q&A system |
| Citations API (Anthropic) | Faithful hallucination — catches model contradicting source | Slight input token increase; cited_text not billed as output | All RAG systems — replaces prompt-based citation hacks |
| Structured output / constrained decoding | Instruction hallucination, schema violations | Near-zero overhead (FSM-based token masking) | Any bounded output space |
| External verification | Factual + attribution hallucination | API call or DB query per claim | High-stakes claims: numbers, citations, URLs |
| Human review gate | All types | Human time — highest cost | Critical paths: medical, legal, financial |
A hallucination in a creative brainstorming tool costs nothing. The same hallucination in a medical diagnosis tool could kill someone. Map your use case to a defense tier: Low stakes → prompt design + structured output only. Medium stakes → add RAG + Citations API. High stakes → add external verification. Critical → add human review gate.
The Citations API: Grounding Done Right
Anthropic's Citations API (GA since January 2025) is the most production-ready solution for faithful hallucination in RAG systems. Instead of asking the model to quote from documents in its text output, the API pins each claim to exact source passages at the API level — guaranteeing citations are valid pointers, not fabrications.
Prompt-based: 'Please cite your sources' → model generates citation text as output tokens (billed, potentially fabricated). Citations API: model outputs citation metadata pointing to character ranges in source documents. The cited_text field is returned for convenience but not billed as output tokens. Citations are structurally guaranteed to point to text that exists in the document you provided.
You cannot enable both citations and structured output (output_config.format) in the same request. The API returns a 400 error. This is by design: citations require interleaving citation metadata with text output, which is incompatible with strict JSON schema constraints. Choose one based on your use case: citations for RAG grounding, structured output for bounded output spaces.
Apply cache_control: {type: 'ephemeral'} to your document content blocks. The document content is cached for up to 5 minutes; subsequent requests using the same document benefit from cache hits. cited_text is not counted toward output tokens in either cached or uncached requests.
How to Measure Hallucination in Your System
The article has told you to 'monitor hallucination rates in production.' Here's what that actually means in code.
Citation coverage + LLM-as-judge score → threshold gate → route to pass, flag, or reject
- ▸Reference-based: compare model output against a ground-truth answer set. High precision, requires curated test set, expensive to build.
- ▸LLM-as-judge: use a separate model call to score faithfulness and groundedness. Low setup cost, scales to production, requires calibration against human judgments.
- ▸Citation coverage: for grounded systems, track what fraction of claims have a citation. Low coverage = model is generating outside document scope.
- ▸User correction rate: log when users edit or reject outputs. Noisy but reflects real-world failure. Good for monitoring drift over time.
Run your LLM-as-judge scorer on a fixed 50–100 question eval set in CI. Fail the build if hallucination_detected rate exceeds your threshold (e.g., >5% for medium-stakes, >1% for high-stakes). This catches regressions when you change prompts, switch models, or update retrieval. Without this gate, you won't know you've broken grounding until users tell you.
Designing Systems That Handle Hallucination
The most important insight about hallucination: you cannot eliminate it, so design for it. The architecture depends entirely on the stakes.
| Stakes level | Example use cases | Appropriate architecture |
|---|---|---|
| Low | Creative brainstorming, draft generation, code autocomplete | Accept hallucination — it's a feature (creativity). No special defense needed. |
| Medium | Customer support, internal Q&A, code suggestions | Citations API + structured output + LLM-as-judge scorer in CI. Flag low-confidence responses for human review. |
| High | Medical information, legal research, financial analysis | Full defense stack: grounded retrieval + Citations API + external claim verification + human review gate before delivery. |
| Critical | Clinical decisions, legal filings, financial transactions | LLM proposes → deterministic system verifies → human executes. LLM never directly acts. |
LLMs hallucinate least when choosing from a fixed set of options. They hallucinate most when generating free-form text. Whenever possible, reframe the task: instead of 'explain the customer's issue,' ask 'classify this issue as: billing | technical | account | other.' Classification plus structured output gives you mathematical guarantees via constrained decoding — the model cannot output something outside the schema.
- ▸Separate LLM reasoning from action execution — the model decides what to do, a deterministic system does it
- ▸Build abstain paths — 'I don't have this information' is a correct answer; design your UX to show it
- ▸Log every hallucination your eval catches — they are a training signal and a product insight
- ▸When users correct outputs, that's a hallucination signal — route corrections back into your eval set
- ▸Monitor hallucination rates over time — prompt drift, model updates, and retrieval quality changes all affect rates
First-30-Days Checklist
These steps are ordered. Each one unlocks the next. Don't skip to external verification before you've built grounding — you won't know what you're verifying.
- ▸Day 1–3: Add abstain permission to all prompts. Give every LLM call an explicit 'say I don't know if the answer isn't available' instruction. Zero cost, immediate improvement.
- ▸Day 4–7: Add structured output to every endpoint with a bounded output space. If the response is JSON, schema, or a classification — use constrained decoding. This eliminates instruction hallucination.
- ▸Day 8–14: Build a 50-question eval set for your primary use case. Include cases where the answer is NOT in the document. Run LLM-as-judge scorer. Establish your baseline hallucination rate.
- ▸Day 15–21: Integrate Citations API for any RAG-based flow. Compare your before/after grounding score on the eval set. Expect a significant drop in faithful hallucination.
- ▸Day 22–28: Add the eval set to CI. Gate on hallucination rate. You now have a regression detector — you'll know immediately when a prompt change or model update degrades grounding.
- ▸Day 29–30: Add monitoring. Log hallucination_detected=true events from your scorer in production. Set an alert if rate exceeds threshold. You now have visibility into production drift.
Best Practices
Do
- ✓Add explicit abstain permission to every prompt: 'If the answer isn't in the provided context, say so'
- ✓Use the Citations API for all RAG systems — it pins responses to exact source text at the API level
- ✓Use structured output (constrained decoding) for any bounded output space — eliminates instruction hallucination
- ✓Build an eval set of 50+ questions including unanswerable cases, run LLM-as-judge weekly
- ✓Gate CI on hallucination rate — fail the build if grounding score drops below threshold
- ✓Monitor citation coverage in production — low coverage signals the model is generating outside document scope
- ✓Match defense depth to failure cost: low stakes needs only abstain + structured output; high stakes needs full stack
- ✓Log user corrections as a hallucination signal and route them into your eval set
- ✓Use a fast model (claude-haiku-4-5-20251001) for LLM-as-judge scoring to keep eval costs down
Don’t
- ✗Don't trust LLM-generated citations, URLs, or paper references without external verification — attribution hallucination is very common
- ✗Don't use tone or confidence as a proxy for accuracy — assertive phrasing does not correlate with correctness
- ✗Don't force the model to always answer — a well-designed abstain is more trustworthy than a fabricated answer
- ✗Don't run self-consistency checks with string matching — semantic equivalence requires semantic comparison, not Counter()
- ✗Don't assume asking 'Are you sure?' catches hallucination — the model will say 'Yes, I'm sure' and still be wrong
- ✗Don't use LLMs for precise numerical calculations, date lookups, or tasks requiring exact recall — use tools or databases
- ✗Don't combine Citations API with Structured Outputs in the same request — the API returns a 400 error
- ✗Don't rely on a single defense layer for high-stakes use cases — hallucination defense must be layered
- ✗Don't skip the eval set — monitoring without a baseline is just watching numbers you can't interpret
Key Takeaways
- ✓Hallucination rates range from <2% (summarization with grounding) to >60% (medical without grounding) — know your domain's baseline before building defenses.
- ✓Faithful hallucination defeats RAG: a model can retrieve the right document and still contradict it. The Citations API pins claims to exact source text at the API level.
- ✓Five hallucination types require different defenses: faithful → Citations API; instruction → structured output; attribution → external verification; factual → RAG grounding; reasoning → step-by-step verification.
- ✓Structured outputs eliminate instruction hallucination via constrained decoding — the model cannot generate tokens outside your schema.
- ✓Measure with LLM-as-judge: score faithfulness and groundedness on a fixed eval set in CI. Gate on hallucination rate to catch regressions before production.
- ✓Design principle: LLM proposes, deterministic system verifies, human executes — never let an LLM directly take irreversible actions in high-stakes systems.
Video on this topic
Why ChatGPT makes things up (and always will)
tiktok