Deep Dive
Getting LLMs to Actually Listen: My Prompt Engineering Journey
I spent the first few weeks on the wrong problem. I was tweaking model parameters, swapping providers, reading benchmarks, convinced that the right model would fix everything.
It didn't. The model was fine. The prompt was a mess. And everything downstream paid for it. This is what I wish someone had told me at the start.
How it actually works
User Input
raw query
Query Rewrite
normalize
Vector Search
top-k chunks
Metadata
user + session
Context Assembly
rank + trim
LLM
temp + prompt
Validate
cite or abstain
The input you get isn't the input you want
People don't write precise queries. They write how they think: loosely, with gaps, assuming the system knows what they mean. "That pricing thing from a few months ago" is a real thing a real user typed. Sent straight to the model, it retrieves noise. Rewritten first, it retrieves exactly what they wanted.
I added a lightweight rewrite step before retrieval. A small model call that turns vague input into a clean, self-contained query. Felt like overhead at first. Turned out to be one of the highest-leverage things I did.
SYSTEM: You are a query rewriter. Convert the user's casual question into a precise,
self-contained retrieval statement. Remove vague references. Be specific.
USER: "that pricing thing from a few months ago"
REWRITTEN: "Retrieve documents related to pricing models or pricing strategy
updated within the last 6 months."Cleaner query → better retrieval → less noise in context → fewer wrong answers. Every step downstream improves.
Don't ask the model to invent what already exists
The query rewrite feeds directly into retrieval. Before the model generates anything, I check what's already in the vector database. Embed the query, run similarity search, pull the top chunks. If the answer lives somewhere in docs, notes, or past outputs, give it to the model directly. Let it reason over real text instead of making something up.
The similarity threshold is something you tune, not set once and forget. Too low and you get unrelated chunks clogging context. Too high and you miss things that are relevant but phrased differently. I tend to land around 0.75–0.82 cosine similarity, always retrieve a few extra, then re-rank before passing anything to the model.
Query: "What's the refund policy for enterprise plans?"
Threshold 0.55 (too low):
[0.57] "We offer refunds in certain situations..." ← wrong plan type
[0.61] "Enterprise billing is processed annually..." ← off-topic
[0.78] "Enterprise refunds are processed within 30 days..." ← correct
Threshold 0.78 (tuned):
[0.78] "Enterprise refunds are processed within 30 days..."
[0.81] "Cancellation terms for enterprise accounts..."At the lower threshold, the right chunk is in there, buried under noise. The model sees all three and sometimes blends them. Tune the threshold and you give it less to get confused by.
More context made my outputs worse
Good chunks don't automatically mean a good prompt. How you arrange them matters just as much as which ones you pick. I had access to a 128k-token model. I used almost all of it. Quality dropped. Turns out there's a well-documented "lost in the middle" problem. Models pay close attention to the beginning and end of a prompt and tend to skim everything in between. I was burying the most relevant information in the middle.
Now I treat context as a curation problem, not a capacity problem. Most relevant chunk first. Weak chunks trimmed out. System prompt broken into three clear sections: who the model is, what it must not do, and what format to respond in. That order matters.
SYSTEM:
[WHO YOU ARE]
You are a precise assistant that answers only from the provided context.
[WHAT YOU MUST NOT DO]
Do not infer, speculate, or answer from prior knowledge.
If the answer isn't in the context, say so.
[OUTPUT FORMAT]
Respond in plain sentences. Cite the source chunk by index.
CONTEXT:
[Chunk 1] ...most relevant passage...
[Chunk 2] ...second most relevant...
USER: <cleaned query here>Hallucination is a prompt problem, not a model problem
Even with curated context in the right order, the model will still fabricate if you give it room to. Every time I've chased down a hallucination, there's been a prompt decision behind it. No grounding context provided so the model invented one. Conflicting chunks so it averaged them. An ambiguous instruction so fabrication was the easiest path forward.
The fix that helped most: tell the model explicitly what to do when it doesn't know. This "cite-or-abstain" instruction, once you name it and commit to it, cut my hallucination rate more than switching models ever did.
You must answer using only the context provided below.
If the context does not contain enough information to answer confidently,
respond with: "I don't have enough information to answer this accurately."
Do not guess. Do not infer from general knowledge. Do not fabricate.Beyond that, a few things that consistently helped:
- —Chain-of-thought: Ask the model to reason before it answers. Wrong answers almost always have a reasoning gap you can catch early.
- —Output validation: For production, run a lightweight check: does the answer reference the retrieved context, or did it drift? Flag and retry if not.
- —Eval over a golden dataset: A golden dataset is a fixed set of inputs with known correct outputs that you curate manually. Run every prompt change against it. You can't reduce hallucination rate if you're not measuring it, and model updates can silently regress behavior you thought was stable.
Temperature isn't a preference, it's a decision
You've handled input, retrieval, context, and hallucination. The last prompt-level decision is how consistent you want the output to be. Most people pick a temperature and never change it. I did the same thing for too long. Temperature isn't about style. It directly controls whether your outputs are consistent or unpredictable, and the right value is task-specific.
0 – 0.2
Factual / Structured
SQL, JSON, extraction. Same input should produce the same output, every time.
0.3 – 0.5
Rewriting / Analysis
Summaries, rewrites. A little variation makes it feel more natural without losing accuracy.
0.7 – 1.0
Creative / Brainstorm
Ideas, exploration. This is where you actually want the model to surprise you.
One more thing: at low temperature, the model mirrors format almost exactly. Show it a JSON example in the prompt and it returns JSON consistently. That predictability is a feature, so use it.
Metadata does more work than it gets credit for
Temperature is about output variance. Metadata is about input precision. I think of it in two places: on the chunks in your vector store, and in the prompt itself.
On the retrieval side, every stored chunk should carry enough metadata to filter on and cite from: source document, section heading, date, document type. Without it, the model paraphrases without attribution, you can't verify what it's pulling from, and pre-filtering before retrieval becomes impossible. On the prompt side, pass context about the user and the session: who they are, what they're working on. It sounds like noise but it meaningfully shifts how the model responds.
SYSTEM METADATA (pass with every request):
- User context: developer, working on search feature
- Session history: asked about embedding models 2 turns ago
RETRIEVED CHUNKS:
[source: docs/embeddings.md | section: "Choosing a model"]
...chunk text...What about fine-tuning, isn't that just better?
Fine-tuning is different in kind, not just degree. Prompt engineering shapes how a model behaves at runtime. Fine-tuning changes the model itself. You train on thousands of examples until the behavior is baked into the weights, not instructed at call time. Shorter prompts, less hand-holding, consistent tone and format without spelling it out every time.
But it's expensive, brittle when requirements change, and most of the time unnecessary. My rule: exhaust prompt engineering first. If the model still can't do what you need reliably after all of this, that's when fine-tuning earns its cost. More often than not, the problem isn't the weights. It's the instructions.
None of it works in isolation
The reason I kept hitting walls early on was that I was fixing one thing at a time. Better retrieval but a messy prompt. Good context but wrong temperature for the task. Clean query but no metadata to filter on. Each piece helps a little. All of them together is when things actually start working reliably.
Prompt engineering gets a bad reputation. People think of it as trial-and-error word tweaking. It's not. It's designing a system where the model has exactly what it needs, no more, no less, in the right order, with the right constraints. The model is capable. Your job is to set it up to succeed.
What actually stuck
- —Rewrite before you retrieve. One small model call upstream cleans the query and improves every step that follows.
- —Tune your similarity threshold. Too low and the right answer drowns in noise. Start around 0.75, retrieve a few extra, then re-rank.
- —Context is curation, not capacity. Most relevant chunk first. Weak chunks out. More tokens doesn't mean better answers.
- —Hallucination is upstream of the model. Give it an explicit fallback, a structured prompt, and something to cite. Most hallucinations trace back to a missing instruction.
- —Match temperature to the task. Factual extraction wants 0.1. Summaries want 0.4. Creative wants 0.8. Picking once and leaving it is leaving quality on the table.
- —Metadata enables everything else. Attribution, filtering, session context. Without it you're running blind and so is the model.