How to do prompt engineering
Write prompts that reliably produce accurate, well-structured outputs from a language model for a defined application task — and iterate systematically when they do not.
Before you start
- Access to a language model through a chat interface or API
- A clear description of the task you want the model to perform and what a good output looks like
- A way to test prompts and observe outputs — either manual review or a scripted evaluation
Steps
- 1
Start with a direct, complete task specification
The first thing a prompt needs is a clear statement of what the model should do. Vague instructions produce unpredictable outputs. Specify: what the task is, what the model's role is in relation to it, what the output should contain, and any hard constraints. Write the specification as if explaining the task to a competent colleague who has no background in your context. Avoid ambiguous verbs like 'analyze' or 'improve' without specifying what analysis or improvement means for this task. The test for completeness is whether someone reading only the prompt could produce a correct output without asking any clarifying questions.
- 2
Provide necessary context in the prompt
Models do not have access to information outside the prompt unless it is explicitly included. If the task requires background knowledge — a document to process, a policy to apply, a definition to use — include it. Context should be relevant and concise: including everything you know about a topic often dilutes the signal of what is actually important for the task. When context is long, consider structuring it with clear labels (Background:, Policy:, Example:) so the model can locate relevant information without searching the entire context window.
- 3
Specify output format explicitly
If you need output in a specific format — JSON, a numbered list, a structured report with specific sections, a specific word count range — state it clearly in the prompt. Models will produce some format by default, but it is rarely the one your downstream system expects. For structured outputs, give an example of the format: models follow format examples more reliably than format descriptions alone. For prose outputs, specify tone, length, and intended audience. If the model will call a tool or the output will be parsed programmatically, format requirements are not optional — a malformed output will cause a downstream failure.
- 4
Add worked examples for complex tasks
For tasks where the quality criteria are hard to specify in words — tone, judgment, reasoning style — worked examples are more effective than descriptions. Include one or more complete input-output pairs that demonstrate exactly what correct looks like. Examples are particularly useful for tasks involving classification, entity extraction, rewriting, or any case where the boundary between acceptable and unacceptable output is subtle. Keep examples realistic — use representative inputs rather than idealized ones, so the model learns from the distribution it will actually encounter.
- 5
Test against edge cases, not just typical inputs
A prompt that works on the first test input you tried does not necessarily work on the full range of inputs the model will receive. Test explicitly against: inputs that are ambiguous, inputs that are out of scope, inputs that are very short or very long, and inputs that probe the boundaries of your constraints. Document the inputs that produce unexpected outputs — these reveal where your prompt specification is incomplete or ambiguous. Fix specification gaps in the prompt rather than treating edge case failures as acceptable model behavior.
- 6
Iterate systematically and document what works
Prompt engineering is iterative: each revision should address a specific observed failure, not introduce changes at random. When a prompt is not working, identify whether the failure is a missing constraint, a missing example, an ambiguous instruction, or a task the model is not capable of. Make one change at a time, test against the same input set, and compare results. Document the prompts that work and why — without documentation, improvements are lost when the person who made them is not available, and the same failure modes get rediscovered repeatedly.
Common pitfalls
- Changing multiple prompt elements at once: when something improves or breaks, you cannot determine which change caused it.
- Testing only on inputs that look like the expected case: most prompt failures occur on edge cases, not typical inputs.
- Using vague positive framing without specifying what it means: instructions like 'be helpful' or 'write clearly' do not give the model actionable guidance.
- Treating the first working prompt as final: prompts that work on a small test set often degrade when the model is updated or when the input distribution shifts in production.
- Not version-controlling prompts: prompts are part of the application and should be managed with the same rigor as code.