Prompts Are Code: How to Engineer LLM Prompts for Production Systems

prompt engineering LLM production AI prompt testing prompt versioning context engineering

Stop Tweaking. Start Engineering.

You wouldn't ship a function you only tested by eyeballing the output in a REPL. But that's exactly how most engineers treat prompt development — iterate in ChatGPT, copy-paste when it feels right, and hope it holds up in production.

It won't. Not consistently. Not at scale.

Prompts are runtime instructions for a non-deterministic system that serves thousands of requests. They deserve the same rigor you give your API contracts, your database queries, and your configuration files. That means structure, versioning, and automated testing.

Here's how to get there.


The Anatomy of a Reliable Prompt

Ad-hoc prompts fail because they're ambiguous. The LLM fills in gaps with assumptions — and those assumptions won't match your users' expectations. A structured prompt eliminates the gaps.

Every production prompt should have five components:

Skip any of these and you're writing a wish, not a specification.

python SUPPORT_PROMPT_V2 = """ You are a support assistant for Acme, a B2B invoicing platform.

Context

User plan: {user_plan} Recent error logs: {error_context}

Task

Answer the user's question using only the provided context. If the answer isn't in the context, say: "I don't have enough information to answer that — please contact support@acme.com."

Constraints

Output Format

Plain text. No markdown. No bullet points.

User Question

{user_question} """

Notice what this does: it reads like a spec. Another engineer can look at it and immediately know what this prompt is supposed to do, what it's not supposed to do, and what a correct output looks like.


Version Your Prompts Like Code

A prompt is a production artifact. Treat it like one.

python PROMPT_REGISTRY = { "support_v1": SUPPORT_PROMPT_V1, "support_v2": SUPPORT_PROMPT_V2, # current production }

def get_prompt(name: str) -> str: if name not in PROMPT_REGISTRY: raise ValueError(f"Unknown prompt: {name}") return PROMPT_REGISTRY[name]

This makes rollback trivial and gives you a clear change history.


Test Prompts Systematically

Here's the pattern: build a small golden dataset of inputs and expected outputs, then run every prompt version against it before shipping.

python test_cases = [ { "input": {"user_question": "How do I export invoices?", "user_plan": "pro", "error_context": ""}, "must_contain": ["export"], "must_not_contain": ["I don't know", "competitor"], }, { "input": {"user_question": "What's the weather today?", "user_plan": "free", "error_context": ""}, "must_contain": ["support@acme.com"], # should deflect off-topic questions "must_not_contain": [], }, ]

def run_prompt_tests(prompt_template: str, cases: list) -> dict: results = {"passed": 0, "failed": 0, "failures": []} for case in cases: response = call_llm(prompt_template.format(**case["input"])) passed = all(phrase in response for phrase in case["must_contain"]) and
all(phrase not in response for phrase in case["must_not_contain"]) if passed: results["passed"] += 1 else: results["failed"] += 1 results["failures"].append({"input": case["input"], "response": response}) return results

This isn't a replacement for human review — but it catches regressions automatically, just like unit tests catch broken logic.


The Three Failure Modes to Watch For

Most production prompt failures fall into one of three categories:

  1. Ambiguous instructions"Be helpful" is not a task. "Answer only questions about invoice generation using the provided documentation" is.
  2. Missing constraints — Without explicit guardrails, LLMs will speculate, hallucinate, and go off-topic. Every prompt needs a "If you don't know, say X" clause.
  3. Hallucination triggers — Asking the model to answer questions it has no grounding data for. If your prompt doesn't include the relevant context, the model invents it. Constraints alone don't fix this — you need to provide the context.

The Takeaway

A prompt without structure, versioning, and test coverage isn't an AI feature — it's a liability. Treat your prompts like the production artifacts they are, and your LLM outputs will be as reliable as the rest of your system.