Techniques That Work
Evidence-based prompt engineering techniques: chain-of-thought reasoning, self-consistency, role prompting, and step-by-step decomposition. When each technique helps, when it hurts, and how to measure the improvement.
Quick Reference
- →Chain-of-thought (CoT): 'Think step by step' -- improves reasoning tasks by 10-40%, hurts simple tasks
- →Self-consistency: generate N answers, take majority vote -- reduces random errors by ~15-25%
- →Role prompting: assign expertise persona -- improves domain tasks, marginal for general tasks
- →Decomposition: break complex tasks into subtasks -- most reliable technique for complex problems
- →Not all techniques work for all tasks -- always measure improvement on your specific use case
- →Combining techniques (CoT + self-consistency) often gives better results than either alone
In this article
Chain-of-Thought Reasoning
Chain-of-thought (CoT) prompting asks the model to show its reasoning before giving the final answer. The seminal paper by Wei et al. (2022) showed this improves performance on reasoning tasks by 10-40%. The simplest form is adding 'Think step by step' to the prompt. But CoT is not a universal improvement -- it helps reasoning tasks and can actually hurt simple tasks.
| Task type | CoT improvement | Recommendation |
|---|---|---|
| Math word problems | +20-40% | Always use CoT |
| Multi-step reasoning | +15-30% | Always use CoT |
| Code debugging | +10-20% | Use CoT for complex bugs |
| Simple classification | 0% or negative | Skip CoT |
| Factual Q&A | 0-5% | Usually skip CoT |
| Creative writing | Negative | Skip -- reduces creativity |
Chain-of-thought generates 3-10x more output tokens. At $8/1M output tokens for GPT-5.4, this adds up quickly. Only use CoT when it demonstrably improves accuracy on your task. For high-volume classification, the extra tokens from CoT can double or triple your costs without improving results.
Self-Consistency (Majority Vote)
Self-consistency generates multiple responses at higher temperature, then takes the majority answer. The intuition is that correct reasoning paths are more likely to agree than incorrect ones. This technique is especially effective when combined with chain-of-thought.
Self-consistency works best for tasks with a single correct answer (math, classification, extraction). It is less useful for open-ended generation (writing, summarization) where there is no 'majority' answer. The cost is N times inference, so reserve it for high-stakes decisions where accuracy is worth the extra spend.
Role Prompting and Persona Setting
Role prompting assigns the model a specific persona or expertise. It works because the model adjusts its vocabulary, depth of explanation, and decision-making patterns based on the role. The effect is strongest when the role is specific and relevant to the task.
"You are a Python expert" is okay. "You are a senior Python engineer specializing in async programming and performance optimization with 10 years of experience" is much better. The more specific the role, the more the model adjusts its behavior. Include specific expertise areas, experience level, and the lens through which it should evaluate things.
Step-by-Step Decomposition
Task decomposition breaks a complex problem into smaller, manageable subtasks. This is the most reliable technique for complex problems because it reduces the cognitive load on each LLM call and makes the output more predictable and debuggable.
- ▸Each subtask gets a focused prompt, reducing the chance of the model losing track of requirements
- ▸Intermediate outputs can be validated before proceeding to the next step
- ▸Failed steps can be retried without rerunning the entire pipeline
- ▸Different steps can use different models (cheap for extraction, expensive for analysis)
- ▸The approach is more debuggable -- you can inspect each intermediate result
If your prompt is longer than 500 tokens of instructions (not counting data), it is probably trying to do too much in one call. Break it into steps. If you find yourself writing 'also' or 'additionally' more than twice in a prompt, that is a sign you should decompose.
Measuring Technique Effectiveness
Every technique has a cost (tokens, latency, complexity) and a benefit (accuracy improvement). You should measure both before committing to a technique in production.
With only 50 test cases, a 5% accuracy difference might not be statistically significant. Use at least 100 test cases for reliable comparisons, and apply a statistical test (McNemar's test for paired comparisons) before concluding that one technique is better than another.
Best Practices
Do
- ✓Use chain-of-thought for reasoning and math tasks -- it reliably improves accuracy 10-40%
- ✓Decompose complex tasks into focused subtasks rather than one massive prompt
- ✓Measure the cost-benefit trade-off of each technique on your specific task
- ✓Combine techniques when stakes are high: CoT + self-consistency + domain-specific role
- ✓Use specific, detailed role prompts -- the more specific the expertise, the better the output
Don’t
- ✗Don't use chain-of-thought for simple tasks -- it wastes tokens without improving accuracy
- ✗Don't apply techniques blindly -- always measure improvement on your data
- ✗Don't assume techniques that work on benchmarks will work on your production tasks
- ✗Don't use self-consistency for open-ended generation (writing, brainstorming)
- ✗Don't stack every technique simultaneously -- each adds cost, and returns diminish quickly
Key Takeaways
- ✓Chain-of-thought improves reasoning tasks 10-40% but hurts simple tasks and costs 3-10x more tokens.
- ✓Self-consistency (majority vote across N responses) catches random errors but costs N times inference.
- ✓Specific role prompting with domain expertise produces measurably better results than generic roles.
- ✓Task decomposition is the most reliable technique for complex problems -- break big prompts into focused steps.
- ✓Always measure technique effectiveness on your data with your metrics before committing to production use.
Video on this topic
Prompt engineering techniques that actually work