Six months into using LLMs in production, I had a classification pipeline that was wrong about 30% of the time. I’d spent weeks tweaking temperature, swapping models, writing longer system prompts. Nothing stuck. Then I rewrote one prompt using chain-of-thought reasoning and the error rate dropped to around 8% overnight — same model, same temperature, same data.
That experience broke something in my head about what prompt engineering actually is. It’s not about writing clearer instructions. It’s about changing how the model thinks through the problem, not just what it’s supposed to do.
Here’s what I’ve learned since then, including the techniques that genuinely moved the needle and a few that looked promising but wasted a lot of my time.
Chain-of-Thought Isn’t Just “Show Your Work”
The basic version of CoT is well-known at this point: add “think step by step” to your prompt and watch accuracy improve on reasoning tasks. But most developers stop there and leave a lot on the table.
What actually matters is where you surface the reasoning and how structured you make it. I spent about two weeks running variants on a document extraction task (pulling structured fields from messy legal contracts — not glamorous, but real). A bare think step by step helped modestly. What really helped was telling the model to reason through each field independently before committing to an answer, with explicit uncertainty markers.
Here’s roughly what the prompt looked like after iteration:
SYSTEM_PROMPT = """
You are extracting structured data from legal contract text.
For each field below, reason through it before writing your answer:
1. Identify what evidence in the document supports this value
2. Note any ambiguity or conflicting signals
3. Then output your best answer with a confidence level (high/medium/low)
If you have low confidence, explain why rather than guessing silently.
"""
USER_PROMPT = """
Contract text:
{contract_text}
Extract the following fields:
- effective_date
- termination_clause_type
- governing_law_jurisdiction
- auto_renewal (yes/no/unclear)
"""
The key change was requiring the model to flag its own uncertainty rather than projecting false confidence. Previously I’d get a clean JSON blob that looked great but had quietly hallucinated a governing jurisdiction. Now I get hedged output I can actually route differently — send high-confidence extractions straight through, flag medium/low for human review.
One thing I noticed: the reasoning trace itself becomes a debugging tool. When a field comes back wrong, I can read the model’s chain of thought and usually see exactly where it went sideways. That’s worth something even beyond the accuracy gain.
Gotcha I hit hard: CoT inflates token usage significantly. On a high-volume pipeline — we were doing around 4,000 documents a day at one point — this is not a rounding error. I ended up stripping the reasoning section from the output using a simple post-processing step and only keeping the structured fields. You get the accuracy benefits without paying for the reasoning tokens in downstream processing. Your situation will vary depending on cost constraints, but don’t assume you have to ship the chain of thought to your users.
Few-Shot Examples: The Technique That Works Until It Doesn’t
Few-shot prompting is probably the most misunderstood technique in the toolbox. The common advice is “include 3-5 examples.” That’s fine as a starting point, but it misses the important variables: diversity of your examples, proximity to your edge cases, and whether you’re accidentally teaching the model the wrong generalization.
That last one bit me on a sentiment classification task. I’d included five examples in my prompt, all of which happened to be single-sentence reviews. Production data had multi-paragraph reviews with mixed sentiment — positive overall but mentioning specific negatives in the body. My few-shot examples had inadvertently taught the model to anchor on the first sentence. Took me two days to figure out why accuracy cratered on longer inputs.
What fixed it wasn’t adding more examples — it was adding strategically selected examples that represented the failure modes. I specifically included:
- One example where the opening sentence is negative but the overall sentiment is positive
- One that’s sarcastic
- One that’s genuinely mixed and should return “neutral” rather than forcing a classification
This is less about volume and more about coverage. Three great examples beats eight mediocre ones.
Here’s what a well-structured few-shot block looks like for something like intent classification:
FEW_SHOT_EXAMPLES = [
{
"input": "Can you help me reset my password? I've tried three times.",
"reasoning": "User is making a direct request for a specific account action. Frustration implied but the intent is clearly a password reset, not a complaint.",
"intent": "account_action",
"confidence": "high"
},
{
"input": "I guess the product works okay but it's not really what I expected from the description.",
"reasoning": "This is passive dissatisfaction, not an explicit request. User isn't asking for anything specific — more likely venting or leaving feedback.",
"intent": "feedback",
"confidence": "medium" # mixed signals here
},
{
"input": "When will my order arrive? The website says it shipped but the tracking hasn't updated in 5 days.",
"reasoning": "Surface-level this looks like an order status query, but the 5-day stale tracking detail implies a potential lost shipment. Route to shipping support, not generic order status.",
"intent": "shipping_issue",
"confidence": "high"
}
]
Notice I’m including reasoning in the examples themselves, not just input/output pairs. This is CoT applied to few-shot — you’re showing the model how to think about classification decisions, not just what the answer is. I started doing this about eight months ago and it’s now standard in everything I build.
One practical note: if you’re using an API with a messages array (OpenAI, Anthropic, etc.), you can format few-shot examples as alternating user/assistant turns rather than stuffing them all into the system prompt. In my experience this produces slightly cleaner behavior, probably because it’s closer to how the model was trained on conversation data. Not a huge difference, but worth knowing.
The Techniques I Wish I’d Found Earlier
Self-consistency sampling is criminally underused. The idea: run the same prompt multiple times (3-5 times), then take the most common answer. It’s embarrassingly simple and it works — especially for tasks with a single correct answer buried in ambiguous context.
I used this on a legal clause extraction job where the model would occasionally hallucinate a clause that didn’t exist (very bad in a legal context). Running the extraction five times and only surfacing clauses that appeared in at least three responses cut hallucination incidents by roughly 60% in our testing. It’s not cheap — you’re literally paying for 5x the tokens — but for high-stakes, low-volume tasks it’s an easy call.
Persona + constraint stacking is another one. This is different from basic role prompting (“you are a senior developer”). The useful version layers constraints that bound the model’s behavior. Example:
You are a senior backend engineer reviewing a junior’s PR. You have strong opinions about code quality but your goal is to be educational, not discouraging. You must raise every issue you see, but frame at least one piece of feedback per issue as a question rather than a directive. Do not approve the PR if there are any security concerns.
Each constraint there does real work. Stacking them creates a behavioral space that’s hard to specify any other way. I’m not 100% sure this scales beyond a certain number of constraints — somewhere around 7-8 I’ve seen the model start dropping some of them — but for 3-5 constraints it’s reliable.
Output format as a constraint, not an afterthought. Most developers (including past me) put format instructions at the end of a prompt as a cleanup step: “…and return your answer as JSON.” That’s late. Specifying format early and being precise about why you need it in that format changes output quality meaningfully.
Compare:
– ❌ “Return your answer as JSON.”
– ✅ “Your output will be parsed programmatically by a Go struct with strict type requirements. Return only valid JSON with no markdown code fences, no trailing commas, and no comments. Use ISO 8601 dates. Null is acceptable for missing values; do not omit keys.”
The specificity signals to the model what kind of precision is required. Vague format instructions get vague compliance.
Where Prompt Engineering Actually Has Limits
Here’s something I don’t see enough developers admit: there are tasks where prompt engineering isn’t the bottleneck and you’re wasting your time optimizing prompts.
If your base model genuinely doesn’t have the domain knowledge you need, no amount of CoT or few-shot examples will save you. I spent three weeks trying to prompt-engineer a model into performing well on very domain-specific pharmaceutical regulatory text. I got incremental improvements but kept hitting a ceiling. The actual fix was RAG — pulling in the relevant regulatory documents as context — not a cleverer prompt.
Similarly, if your task requires consistent multi-step behavior across a long conversation, you’re fighting against context degradation and instruction drift. Prompt engineering techniques that work great at the start of a conversation tend to degrade by turn 15 or 20. I don’t have a great solution for this one beyond shorter context windows and more explicit state management. Your mileage may vary.
The other limit is evaluation. Prompt engineering without a proper eval setup is basically guessing. I see a lot of developers (again, including early me) iterating prompts based on vibes and a handful of anecdotal examples. You need a held-out test set and you need to measure before and after each change. Even a scrappy pytest file with 50 representative examples beats no eval at all.
What I’d Actually Tell You to Start With
If I’m being direct: start with chain-of-thought with explicit uncertainty markers. It’s the highest ROI technique I’ve found, it generalizes across tasks, and it produces debuggable outputs that make everything downstream easier.
Once that’s working, add few-shot examples — but be surgical about it. Don’t grab five random examples. Specifically target your known failure modes and make your examples diverse enough to not accidentally teach wrong generalizations.
Only reach for self-consistency when you have a high-stakes task and hallucination is genuinely costly. The token overhead makes it a bad default.
Skip persona prompting unless you have a clear behavioral reason for it (like the PR review example above). Cargo-culted role prompting — “you are a helpful AI assistant who is an expert in…” — adds noise without adding value in my experience.
And honestly? The most consistent gains I’ve seen come not from any single technique but from iteration with actual evals. Pick a technique, measure it against real test cases, keep what helps, drop what doesn’t. That’s less exciting than a list of tricks, but it’s what actually works in production.
The prompt that fixed my classification pipeline, by the way, is still running. I haven’t touched it in four months. That’s probably the best endorsement I can give for getting the fundamentals right.