Testing Prompts 101: A Simple Evaluation Framework You Can Use Today

Most teams treat prompt evaluation like a visit to the dentist: they know they should do it, but they put it off until something hurts. The good news is that measuring prompt quality does not require a data-science degree, a swarm of labelers, or a six-week compliance sprint. A practical prompt evaluation framework has only four moving parts:

Define what “good” means
Collect a small, high-value test set
Score new outputs against the definition
Track the numbers so you can roll back when things drift

Below is a field-tested, copy-paste-ready version you can run in under an hour with nothing more than a spreadsheet and the same API you already use for completions.

Why Evaluation Matters Even at Small Scale

A prompt that “looks fine” in the playground can quietly fail when the model is updated, when user input shifts, or when a teammate tweaks a single adjective. Without measurements you discover problems only when customers complain or finance asks why the token bill doubled. A lightweight evaluation loop catches issues hours after they are introduced, not weeks.

If you have ever shipped a hot-fix because “the bot suddenly sounds rude,” you already understand the value of prompt testing. Formalizing the process simply makes the safety net predictable.

Step 1: Define Clear Criteria (5 Criteria or Fewer)

Start with business language, not machine-learning jargon. Good criteria are specific, observable, and pass/fail. Examples:

Accuracy – JSON keys match the expected schema
Conciseness – summary ≤ 75 words
Safety – no PII or profanity
Tone – polite and brand-aligned (rated 4 or 5 by internal reviewer)
Cost – ≤ 1 000 output tokens per call

Pick three criteria that matter most for the current prompt. More metrics create noise; fewer hide problems. Write the definitions in a short confluence page and link the doc from your prompt repo so every reviewer sees the same bar.

Step 2: Build a 20-Example Test Set

You do not need thousands of examples to surface 80 % of regressions. Twenty well-chosen inputs that cover common cases plus one or two known edge cases are enough for most B2B features.

Where to find them:

Production logs from the last 7 days (anonymised)
Support tickets that already have human-written answers
Synthetic adversarial inputs (swear words, emojis, foreign snippets)

Store the inputs in a CSV: id, input, gold_note. The gold_note column can be the expected JSON, the correct summary, or simply “should refuse.” Commit the file next to the prompt so version control keeps the two in lock-step.

Step 3: Score with a Simple Rubric

Run the prompt against the 20 examples and record the result in four extra columns:

criteria_1_pass – 1 or 0
criteria_2_pass – 1 or 0
criteria_3_pass – 1 or 0
total_pass – sum of the above

A quick spreadsheet formula gives you a pass rate (total_pass / 60 for three criteria). Anything above 90 % is usually production-ready; below 85 % triggers a fix or an explicit acceptance of risk.

When human judgement is required (tone, politeness), ask two teammates to rate each output 1–5 and average the scores. If they disagree by more than two points, discuss and re-rate. The exercise takes ten minutes and eliminates most subjective drift.

Automate the mechanical checks (JSON validity, word count, token count) in a tiny script that runs in CI. Most teams use under 30 lines of Python; anything that can be unit-tested should be.

Step 4: Track Over Time (and Gate Releases)

Store the pass rate in a simple metric prompt_v1.3.2_accuracy = 0.93. Any subsequent change that drops the number below the previous baseline is automatically flagged in the pull request. If you already follow a repeatable prompt workflow, hook the evaluation into the same CI stage that runs unit tests (see our article on repeatable prompt workflows for a full example).

Keep a short evaluation report in the repo: date, model version, pass rate, notes. The paper trail is priceless when an auditor—or your future self—asks what changed last month.

A Real Mini-Example

Prompt task: extract company name and invoice total from an email body.

Criteria:

JSON is syntactically correct
company_name exists and is ≤ 50 characters
total_amount is a positive number

Test set: 20 anonymised emails.
Baseline pass rate: 18/20 = 90 %.

A new engineer reformats the prompt to “make it friendlier.” Re-run shows 14/20 = 70 %. The regression is caught before the branch is merged, and the friendly adjectives are moved to a non-critical part of the prompt. Pass rate returns to 90 %, and the feature ships on time.

Common Pitfalls and How to Avoid Them

Pitfall: Letting the test set grow forever
Fix: Cap at 50 examples. When you add five new ones, remove the five oldest to keep maintenance sane.

Pitfall: Only testing on synthetic data
Fix: Keep at least 50 % real user inputs so the distribution matches production.

Pitfall: Ignoring cost
Fix: Add a token-count assertion and fail the build if the median jumps > 15 %.

Pitfall: Re-using the same 20 examples for a year
Fix: Schedule a calendar reminder every quarter to swap in fresh logs.

Tools That Speed Things Up

You can stay in Excel and Bash, but purpose-built tooling removes friction:

Prompt Repo eval runner – bulk-executes prompts and returns KPI

Choose the lightest stack that makes skipping evaluation harder than doing it.

Checklist: Is Your Evaluation Framework “Good Enough”?

Criteria are written down and numbered ≤ 5
Test set is version-controlled alongside the prompt
At least one metric is automated (JSON, word count, cost)
Pass rate is recorded for every prompt version
Regression triggers an automatic warning in CI/CD

If you tick all five boxes, your prompt evaluation is already more rigorous than most others.

Takeaway

Prompt evaluation does not need to be fancy; it needs to be consistent. Define what “good” looks like, keep a small test set, score every change, and track the numbers. Do that and prompt quality stops being a guessing game. It becomes a team habit, and your AI features ship with the same confidence you expect from any other production code.