LLM Foundations/Prompt Engineering as a Discipline

Intermediate16 min

Techniques That Work

Six prompt engineering techniques — few-shot, chain-of-thought, self-consistency, role prompting, decomposition, and cost math — with the decision framework for picking the right one and the cost arithmetic every production engineer needs before committing.

Quick Reference

→Few-shot: 3–8 input→output examples in the prompt — highest ROI for classification and extraction
→Chain-of-thought (CoT): 'Think step by step' — helps reasoning tasks on smaller/non-reasoning models; weaker benefit on frontier models that reason natively
→Self-consistency: sample N responses, take majority — 2025 RASC method cuts compute 70% vs naive voting
→Role prompting: specific expert persona in system prompt — shifts vocabulary and decision lens at zero token cost
→Decomposition: split complex tasks into focused subtasks — most reliable for multi-step problems with intermediate validation
→CoT adds 3–8× output tokens — at $15/1M output tokens (GPT-5.4), that's real budget impact
→Self-consistency N=5 costs ~15× baseline — reserve for high-stakes decisions where accuracy justifies the spend
→Measure on your data before committing — benchmark gains don't transfer to your task without verification

In this article

1.When to Use Which Technique
2.Few-Shot Prompting
3.Chain-of-Thought Reasoning
4.Self-Consistency and Voting Strategies
5.Role Prompting
6.Task Decomposition
7.Cost Math: What Techniques Actually Cost
8.Measuring Technique Effectiveness
9.How Techniques Fail in Production
★Best Practices
✓Key Takeaways

When to Use Which Technique

The most common mistake in prompt engineering is applying techniques by habit rather than by task type. Chain-of-thought on a spam classifier burns tokens with no accuracy gain. Zero-shot on a complex reasoning task leaves performance on the table. The decision is: what does the model actually need to succeed at this task?

Task type → technique — not every task needs CoT or self-consistency

Task type	Best starting technique	When to add more
Simple classification (binary/enum)	Zero-shot or few-shot (3 examples)	Add few-shot if accuracy < target
Extraction (JSON, entities, slots)	Few-shot with structured output format	Add role prompt for domain terminology
Math / logic / reasoning	CoT (non-reasoning models) or zero-shot (reasoning models)	Add self-consistency for high-stakes decisions
Complex multi-step pipeline	Decomposition into focused subtasks	Different models per step based on cost
Domain-specific review / analysis	Role prompting + CoT or few-shot	Self-consistency if decisions are high-value
Creative / open-ended generation	Zero-shot or role prompting	Avoid CoT — it reduces variance and creativity

Reasoning models change the CoT calculus

Frontier models like GPT-5.4 and Claude Opus 4.7 already perform internal chain-of-thought. Adding 'Think step by step' to a prompt sent to these models typically yields no accuracy improvement while paying 3–8× the output token cost. Save CoT for non-reasoning models (gpt-5.4-mini, gpt-5.4-nano, claude-haiku-4-5) or tasks where you want to see the reasoning trace in the output.

Few-Shot Prompting

Few-shot prompting is the single most deployed prompt engineering technique in production. You provide 3–8 input→output examples that demonstrate the exact format, tone, and reasoning pattern you want. The model treats them as implicit instructions — without writing a single rule. It works because the model's pattern-matching capabilities are stronger than its ability to interpret abstract instructions for novel output formats.

Few-shot for structured extraction — production pattern

Few-shot quality beats quantity

3 well-chosen examples consistently outperform 8 mediocre ones. Select examples that cover edge cases and failure modes — the cases the model is most likely to get wrong. For classification tasks, include at least one example per class. More than 8 examples adds cost and can cause the model to overfit to the example format rather than generalize.

▸Place examples in the system prompt (not the user turn) — keeps the conversation turn clean for the actual input
▸Format examples exactly as you want output formatted — the model copies format precisely
▸For JSON extraction, include one example per likely category your system will see
▸Dynamic few-shot (retrieving relevant examples per query) can improve accuracy further but adds latency — worth it for high-value decisions
▸Few-shot is not few-shot fine-tuning: examples exist only within the context window, not baked into weights

Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting asks the model to produce its reasoning before giving the final answer. Wei et al. (2022) showed significant improvements on reasoning benchmarks, particularly for smaller models that do not reason natively. The key update for 2026: the benefit is highly model-dependent. For non-reasoning models, CoT still reliably improves multi-step reasoning. For frontier reasoning models (GPT-5.4, Claude Opus 4.7), the model already reasons internally — adding CoT to the prompt mostly just moves that reasoning into visible output tokens you pay for.

CoT with explicit answer extraction — use on non-reasoning models

Task	Model type	Use CoT?	Reason
Multi-step math / logic	Non-reasoning (mini, haiku)	Yes	Significant accuracy gain; model needs explicit steps
Multi-step math / logic	Reasoning (GPT-5.4, Opus 4.7)	No	Model reasons internally; CoT just adds output cost
Code debugging (complex)	Any	Yes (structured)	Trace through execution explicitly to catch edge cases
Spam / sentiment classification	Any	No	Zero accuracy gain; 3–8× token waste
Structured extraction (JSON)	Any	No	Few-shot is better; CoT does not improve format adherence
Auditable decisions	Any	Yes	You need the reasoning trace in the output for logging

Always extract the final answer explicitly

Prompt the model to end with a structured marker (e.g., 'ANSWER: <value>' or 'DECISION: <yes/no>'). Without this, parsing the final answer out of a reasoning trace requires fragile text processing. Use regex on the marker — never try to parse the last sentence of a reasoning chain.

Self-Consistency and Voting Strategies

Self-consistency generates multiple responses at higher temperature and takes the majority answer. The original Wang et al. (2022) method simply samples N times and counts. This is still correct, but 2025 research produced two significantly more efficient approaches: RASC (Reasoning-Aware Self-Consistency, which cuts sample count by ~70% with adaptive stopping) and CISC (Confidence-Improved Self-Consistency, which weights votes by confidence score and matches naive accuracy with 46% fewer samples). In production you should use one of these — not the naive baseline.

Confidence-weighted self-consistency (CISC) — production baseline

Self-consistency perpetuates systematic errors

A 2025 Stanford HAI study found that self-consistency perpetuated errors in 22% of legal reasoning cases where the majority of paths contained the same logical flaw. If all N samples start from the same flawed premise, majority vote locks in the wrong answer with high confidence. Self-consistency reduces random variance — it cannot fix systematic reasoning errors. If you see high confidence + wrong answer, the prompt has a structural flaw.

When NOT to use self-consistency

Skip self-consistency for open-ended generation (writing, summarization, brainstorming) — there is no majority answer to vote on. Also skip for tasks where latency matters and the baseline accuracy is already acceptable. Reserve self-consistency for high-stakes decisions where accuracy is worth N× inference cost and there is a single correct answer to converge on.

Role Prompting

Role prompting assigns the model a specific expert persona in the system prompt. It's the only technique here with near-zero cost — it shifts the model's vocabulary, decision framing, and depth of domain knowledge without adding significant tokens. The catch: generic roles ('You are a helpful assistant') have no measurable effect. The specificity of the role determines how much behavior shifts.

Specific vs generic role — measurable difference in code review quality

Real project

A fintech team was using role prompting for contract analysis with 'You are a legal expert.' The model's outputs were generic and missed jurisdiction-specific clauses. Switching to 'You are a senior commercial contracts attorney specializing in SaaS agreements under US law, with particular attention to liability caps, IP ownership, and data processing clauses' cut false-negative rate on critical clause detection from 31% to 11% on their eval set — with zero additional tokens in the average response.

Learn this in → Role specificity is a free accuracy improvement — the model already has the knowledge, the role is the retrieval key.

▸Include the domain, seniority level, and the specific lens (what to prioritize) — all three matter
▸Combine role prompting with another technique: role + CoT for analysis tasks, role + few-shot for extraction
▸Do not use role prompting to bypass safety behaviors — it does not and should not work
▸Measure: run your eval set with and without the specific role before committing to it in production

Task Decomposition

Task decomposition breaks a complex prompt into a sequence of focused LLM calls, each with a single clear objective. It is the most reliable technique for complex problems because it reduces context requirements per call, makes intermediate outputs inspectable, and allows retrying failed steps without rerunning the full pipeline. It also enables model tiering: cheap models for extraction steps, expensive models only for judgment steps.

Document classification pipeline — realistic production example

Decomposition heuristic

If your prompt contains 'also' or 'additionally' more than twice, it's doing too much in one call. If your prompt is longer than 500 tokens of instructions (not counting input data), decompose. Each subtask should have exactly one success criterion you can verify in code before proceeding.

Cost Math: What Techniques Actually Cost

Every technique has a token cost. Here is the actual arithmetic using GPT-5.4 pricing ($2.50/1M input tokens, $15/1M output tokens as of April 2026). Adjust the baseline if you use a different model, but the multipliers hold across providers. Assumptions: a typical classification/reasoning task with 200 input tokens (system + prompt) and 100 output tokens at baseline.

Cost-per-call arithmetic for each technique

Cost multiplier (log scale) vs accuracy gain — dashed line = Pareto frontier

Self-consistency at scale is a budget line item

At 100K calls/day, self-consistency N=5 with CoT on GPT-5.4 costs $4,000/day ($120K/month). That is a team member's salary. Run the math for your volume before choosing self-consistency for a high-traffic path. CISC (confidence-weighted voting) typically achieves the same accuracy with N=3–4, cutting the cost to $80K–90K/month — still significant, but more defensible.

Measuring Technique Effectiveness

Every technique adds cost. Whether that cost is worth it depends entirely on your task and data. There is no shortcut here: you need an eval set with ground truth, a judge function, and statistical significance before committing a technique to production.

Eval harness with statistical significance — production baseline

50 test cases is not enough

With N=50, a 5% accuracy gap has a high chance of being noise. Use at least 100 cases for any comparison you plan to act on. Use McNemar's test (shown above) rather than proportion z-tests — it accounts for the paired nature of the comparison and is substantially more sensitive to real improvements.

How Techniques Fail in Production

Each technique has a characteristic failure mode. Knowing these lets you build defenses into your pipeline rather than discovering them in production.

Technique	Characteristic failure	Defense
Chain-of-thought	Model reasons confidently to the wrong answer — hallucination amplification. A flawed premise in step 1 propagates through all steps.	Add an explicit self-check step: 'Review your reasoning for any unverified assumptions.' Parse reasoning trace in logging so you can audit failure cases.
Self-consistency	Majority-wrong: all N paths share the same systematic error, producing a high-confidence wrong answer.	Monitor agreement_rate — a 100% agreement score on a difficult question is a red flag, not a green one. Add a second-opinion call with a different system prompt when agreement < 0.5.
Few-shot	Distribution shift: your examples cover Monday morning traffic, but production sees Friday edge cases the examples never demonstrated.	Refresh your example set when accuracy drops on a weekly eval run. Log cases where the model deviated significantly from example format.
Role prompting	Role drift in long conversations: the model gradually abandons the persona as the conversation grows and context pushes the system prompt toward the edges.	Re-assert the role in a system injection every N turns for long conversations. For single-turn tasks, role drift is not an issue.
Decomposition	Silent propagation: step 1 extracts a field incorrectly, step 2 classifies based on that wrong value, step 3 generates a response based on a wrong classification — each step confident.	Validate intermediate outputs in code before passing to the next step. Add a schema check (Pydantic) on each JSON output. Log all intermediate results to a trace store.

Log intermediate outputs in production

The most valuable production data is not the final output — it's the reasoning trace and intermediate steps. Log CoT reasoning, self-consistency paths, and decomposition intermediates to a structured store. When accuracy drops on your weekly eval, you need to see whether the failure happened at extraction, classification, or generation. Without traces, debugging prompt regressions is guesswork.

Best Practices

✓Start with zero-shot; add techniques only when you have measured accuracy below your target on a real eval set
✓Use few-shot (3–5 examples) for classification and extraction — highest ROI of any technique
✓Reserve CoT for non-reasoning models and tasks where you need the reasoning trace in the output
✓Reserve self-consistency for high-stakes decisions where accuracy justifies N× inference cost
✓Log CoT traces and decomposition intermediates to a structured store — you need them when debugging regressions
✓Use Pydantic models to validate JSON output at every decomposition step boundary before proceeding
✓Run at least 100 test cases before concluding one technique is better than another
✓Apply McNemar's test for paired comparisons — proportion z-tests overstate significance at typical eval set sizes
✓Combine techniques deliberately: role + few-shot for domain extraction; CoT + self-consistency only for high-stakes reasoning
✓Recalculate cost math when you change models or providers — output token prices vary 10–20× across model tiers

Don’t

✗Don't apply CoT to frontier reasoning models (GPT-5.4, Claude Opus 4.7) without measuring — you're likely paying 4× for no gain
✗Don't use self-consistency for open-ended generation — there is no majority answer to converge on
✗Don't treat high self-consistency agreement as a confidence signal — it means low variance, not correct answer
✗Don't use fewer than 100 test cases to compare techniques — the noise floor is too high at N=50
✗Don't skip the cost calculation before committing a technique to a high-traffic endpoint
✗Don't use generic roles ('You are a helpful assistant') — they have no measurable effect; specificity is what shifts behavior
✗Don't stack all techniques simultaneously — returns diminish quickly and debugging complexity grows
✗Don't assume benchmark improvements transfer to your task — always verify on your data before deploying
✗Don't use a single prompt for complex multi-step tasks when intermediate validation could catch errors early
✗Don't omit the ANSWER: extraction marker from CoT prompts — parsing the last sentence of a reasoning trace is fragile

Key Takeaways

✓Few-shot prompting is the highest-ROI technique: 3–5 examples unlock reliable classification and extraction at ~1.1× baseline cost.
✓CoT helps non-reasoning models substantially — but frontier models (GPT-5.4, Opus 4.7) reason internally, so adding CoT mostly adds output token cost.
✓Self-consistency reduces random variance but amplifies systematic errors — a 100% agreement rate on a hard question is a red flag, not a green one.
✓Decomposition enables model tiering: cheap models for extraction, expensive models only for judgment — often cheaper than single-call baselines.
✓Self-consistency N=5 with CoT on GPT-5.4 costs 20× the zero-shot baseline — at 100K calls/day that is $4,000/day; run the math before committing.
✓Always measure on your data: techniques that improve benchmarks by 10% regularly show no improvement on production tasks outside the benchmark distribution.

Video on this topic

Prompt engineering techniques that actually work

instagram

←

Prompt Anatomy

Structured Output Techniques

→