LLM Foundations/Prompt Engineering as a Discipline
Intermediate11 min

Techniques That Work

Evidence-based prompt engineering techniques: chain-of-thought reasoning, self-consistency, role prompting, and step-by-step decomposition. When each technique helps, when it hurts, and how to measure the improvement.

Quick Reference

  • Chain-of-thought (CoT): 'Think step by step' -- improves reasoning tasks by 10-40%, hurts simple tasks
  • Self-consistency: generate N answers, take majority vote -- reduces random errors by ~15-25%
  • Role prompting: assign expertise persona -- improves domain tasks, marginal for general tasks
  • Decomposition: break complex tasks into subtasks -- most reliable technique for complex problems
  • Not all techniques work for all tasks -- always measure improvement on your specific use case
  • Combining techniques (CoT + self-consistency) often gives better results than either alone

Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting asks the model to show its reasoning before giving the final answer. The seminal paper by Wei et al. (2022) showed this improves performance on reasoning tasks by 10-40%. The simplest form is adding 'Think step by step' to the prompt. But CoT is not a universal improvement -- it helps reasoning tasks and can actually hurt simple tasks.

Chain-of-thought: when it helps and when it doesn't
Task typeCoT improvementRecommendation
Math word problems+20-40%Always use CoT
Multi-step reasoning+15-30%Always use CoT
Code debugging+10-20%Use CoT for complex bugs
Simple classification0% or negativeSkip CoT
Factual Q&A0-5%Usually skip CoT
Creative writingNegativeSkip -- reduces creativity
CoT increases cost and latency

Chain-of-thought generates 3-10x more output tokens. At $8/1M output tokens for GPT-5.4, this adds up quickly. Only use CoT when it demonstrably improves accuracy on your task. For high-volume classification, the extra tokens from CoT can double or triple your costs without improving results.

Self-Consistency (Majority Vote)

Self-consistency generates multiple responses at higher temperature, then takes the majority answer. The intuition is that correct reasoning paths are more likely to agree than incorrect ones. This technique is especially effective when combined with chain-of-thought.

Self-consistency implementation with quality measurement
When to use self-consistency

Self-consistency works best for tasks with a single correct answer (math, classification, extraction). It is less useful for open-ended generation (writing, summarization) where there is no 'majority' answer. The cost is N times inference, so reserve it for high-stakes decisions where accuracy is worth the extra spend.

Role Prompting and Persona Setting

Role prompting assigns the model a specific persona or expertise. It works because the model adjusts its vocabulary, depth of explanation, and decision-making patterns based on the role. The effect is strongest when the role is specific and relevant to the task.

Role prompting impact on code review quality
Role specificity matters

"You are a Python expert" is okay. "You are a senior Python engineer specializing in async programming and performance optimization with 10 years of experience" is much better. The more specific the role, the more the model adjusts its behavior. Include specific expertise areas, experience level, and the lens through which it should evaluate things.

Step-by-Step Decomposition

Task decomposition breaks a complex problem into smaller, manageable subtasks. This is the most reliable technique for complex problems because it reduces the cognitive load on each LLM call and makes the output more predictable and debuggable.

Decomposing a complex analysis into steps
  • Each subtask gets a focused prompt, reducing the chance of the model losing track of requirements
  • Intermediate outputs can be validated before proceeding to the next step
  • Failed steps can be retried without rerunning the entire pipeline
  • Different steps can use different models (cheap for extraction, expensive for analysis)
  • The approach is more debuggable -- you can inspect each intermediate result
Decomposition heuristic

If your prompt is longer than 500 tokens of instructions (not counting data), it is probably trying to do too much in one call. Break it into steps. If you find yourself writing 'also' or 'additionally' more than twice in a prompt, that is a sign you should decompose.

Measuring Technique Effectiveness

Every technique has a cost (tokens, latency, complexity) and a benefit (accuracy improvement). You should measure both before committing to a technique in production.

A/B testing prompt techniques
Statistical significance matters

With only 50 test cases, a 5% accuracy difference might not be statistically significant. Use at least 100 test cases for reliable comparisons, and apply a statistical test (McNemar's test for paired comparisons) before concluding that one technique is better than another.

Best Practices

Best Practices

Do

  • Use chain-of-thought for reasoning and math tasks -- it reliably improves accuracy 10-40%
  • Decompose complex tasks into focused subtasks rather than one massive prompt
  • Measure the cost-benefit trade-off of each technique on your specific task
  • Combine techniques when stakes are high: CoT + self-consistency + domain-specific role
  • Use specific, detailed role prompts -- the more specific the expertise, the better the output

Don’t

  • Don't use chain-of-thought for simple tasks -- it wastes tokens without improving accuracy
  • Don't apply techniques blindly -- always measure improvement on your data
  • Don't assume techniques that work on benchmarks will work on your production tasks
  • Don't use self-consistency for open-ended generation (writing, brainstorming)
  • Don't stack every technique simultaneously -- each adds cost, and returns diminish quickly

Key Takeaways

  • Chain-of-thought improves reasoning tasks 10-40% but hurts simple tasks and costs 3-10x more tokens.
  • Self-consistency (majority vote across N responses) catches random errors but costs N times inference.
  • Specific role prompting with domain expertise produces measurably better results than generic roles.
  • Task decomposition is the most reliable technique for complex problems -- break big prompts into focused steps.
  • Always measure technique effectiveness on your data with your metrics before committing to production use.

Video on this topic

Prompt engineering techniques that actually work

instagram