Does the model temperature setting affect how prompt engineering works?

Yes. Higher temperature increases output variability, which makes prompt evaluation less deterministic — the same prompt can produce different outputs on each run. For tasks requiring consistent outputs, use lower temperature. For tasks where some variability is acceptable or useful, higher temperature may be fine. When evaluating prompts, use a fixed temperature so changes in output quality can be attributed to the prompt rather than random variation.

When should I use a system prompt versus including instructions in the user message?

System prompts are appropriate for persistent instructions that apply across all interactions: the model's role, tone, constraints, and format requirements. User message instructions are appropriate for task-specific context and input. In applications where the system prompt is stable and the user message varies, keeping instructions in the system prompt is cleaner and may be more efficient. In cases where the system prompt cannot be modified, all instructions go in the user message.

Prompt Engineering

How to do prompt engineering

Goal

Write prompts that reliably produce accurate, well-structured outputs from a language model for a defined application task — and iterate systematically when they do not.

Before you start

Access to a language model through a chat interface or API
A clear description of the task you want the model to perform and what a good output looks like
A way to test prompts and observe outputs — either manual review or a scripted evaluation

Steps

1

Start with a direct, complete task specification

The first thing a prompt needs is a clear statement of what the model should do. Vague instructions produce unpredictable outputs. Specify: what the task is, what the model's role is in relation to it, what the output should contain, and any hard constraints. Write the specification as if explaining the task to a competent colleague who has no background in your context. Avoid ambiguous verbs like 'analyze' or 'improve' without specifying what analysis or improvement means for this task. The test for completeness is whether someone reading only the prompt could produce a correct output without asking any clarifying questions.
2

Provide necessary context in the prompt

Models do not have access to information outside the prompt unless it is explicitly included. If the task requires background knowledge — a document to process, a policy to apply, a definition to use — include it. Context should be relevant and concise: including everything you know about a topic often dilutes the signal of what is actually important for the task. When context is long, consider structuring it with clear labels (Background:, Policy:, Example:) so the model can locate relevant information without searching the entire context window.
3

Specify output format explicitly

If you need output in a specific format — JSON, a numbered list, a structured report with specific sections, a specific word count range — state it clearly in the prompt. Models will produce some format by default, but it is rarely the one your downstream system expects. For structured outputs, give an example of the format: models follow format examples more reliably than format descriptions alone. For prose outputs, specify tone, length, and intended audience. If the model will call a tool or the output will be parsed programmatically, format requirements are not optional — a malformed output will cause a downstream failure.
4

Add worked examples for complex tasks

For tasks where the quality criteria are hard to specify in words — tone, judgment, reasoning style — worked examples are more effective than descriptions. Include one or more complete input-output pairs that demonstrate exactly what correct looks like. Examples are particularly useful for tasks involving classification, entity extraction, rewriting, or any case where the boundary between acceptable and unacceptable output is subtle. Keep examples realistic — use representative inputs rather than idealized ones, so the model learns from the distribution it will actually encounter.
5

Test against edge cases, not just typical inputs

A prompt that works on the first test input you tried does not necessarily work on the full range of inputs the model will receive. Test explicitly against: inputs that are ambiguous, inputs that are out of scope, inputs that are very short or very long, and inputs that probe the boundaries of your constraints. Document the inputs that produce unexpected outputs — these reveal where your prompt specification is incomplete or ambiguous. Fix specification gaps in the prompt rather than treating edge case failures as acceptable model behavior.
6

Iterate systematically and document what works

Prompt engineering is iterative: each revision should address a specific observed failure, not introduce changes at random. When a prompt is not working, identify whether the failure is a missing constraint, a missing example, an ambiguous instruction, or a task the model is not capable of. Make one change at a time, test against the same input set, and compare results. Document the prompts that work and why — without documentation, improvements are lost when the person who made them is not available, and the same failure modes get rediscovered repeatedly.

Common pitfalls

Changing multiple prompt elements at once: when something improves or breaks, you cannot determine which change caused it.
Testing only on inputs that look like the expected case: most prompt failures occur on edge cases, not typical inputs.
Using vague positive framing without specifying what it means: instructions like 'be helpful' or 'write clearly' do not give the model actionable guidance.
Treating the first working prompt as final: prompts that work on a small test set often degrade when the model is updated or when the input distribution shifts in production.
Not version-controlling prompts: prompts are part of the application and should be managed with the same rigor as code.

Before you start

Steps

Start with a direct, complete task specification

Provide necessary context in the prompt

Specify output format explicitly

Add worked examples for complex tasks

Test against edge cases, not just typical inputs

Iterate systematically and document what works

Common pitfalls

Frequently asked questions