Testing Prompts 101: A Simple Evaluation Framework You Can Use Today
Most teams treat prompt evaluation like a visit to the dentist: they know they should do it, but they put it off until something hurts. The good news is that measuring prompt quality does not require a data-science degree, a swarm of labelers, or a six-week compliance sprint. A practical prompt evaluation framework has only four moving parts:
- Define what “good” means
- Collect a small, high-value test set
- Score new outputs against the definition
- Track the numbers so you can roll back when things drift
Below is a field-tested, copy-paste-ready version you can run in under an hour with nothing more than a spreadsheet and the same API you already use for completions.
Why Evaluation Matters Even at Small Scale
A prompt that “looks fine” in the playground can quietly fail when the model is updated, when user input shifts, or when a teammate tweaks a single adjective. Without measurements you discover problems only when customers complain or finance asks why the token bill doubled. A lightweight evaluation loop catches issues hours after they are introduced, not weeks.
If you have ever shipped a hot-fix because “the bot suddenly sounds rude,” you already understand the value of prompt testing. Formalizing the process simply makes the safety net predictable.
Step 1: Define Clear Criteria (5 Criteria or Fewer)
Start with business language, not machine-learning jargon. Good criteria are specific, observable, and pass/fail. Examples:
- Accuracy – JSON keys match the expected schema
- Conciseness – summary ≤ 75 words
- Safety – no PII or profanity
- Tone – polite and brand-aligned (rated 4 or 5 by internal reviewer)
- Cost – ≤ 1 000 output tokens per call
Pick three criteria that matter most for the current prompt. More metrics create noise; fewer hide problems. Write the definitions in a short confluence page and link the doc from your prompt repo so every reviewer sees the same bar.
Step 2: Build a 20-Example Test Set
You do not need thousands of examples to surface 80 % of regressions. Twenty well-chosen inputs that cover common cases plus one or two known edge cases are enough for most B2B features.
Where to find them:
- Production logs from the last 7 days (anonymised)
- Support tickets that already have human-written answers
- Synthetic adversarial inputs (swear words, emojis, foreign snippets)
Store the inputs in a CSV: id, input, gold_note. The gold_note column can be the expected JSON, the correct summary, or simply “should refuse.” Commit the file next to the prompt so version control keeps the two in lock-step.
Step 3: Score with a Simple Rubric
Run the prompt against the 20 examples and record the result in four extra columns:
criteria_1_pass– 1 or 0criteria_2_pass– 1 or 0criteria_3_pass– 1 or 0total_pass– sum of the above
A quick spreadsheet formula gives you a pass rate (total_pass / 60 for three criteria). Anything above 90 % is usually production-ready; below 85 % triggers a fix or an explicit acceptance of risk.
When human judgement is required (tone, politeness), ask two teammates to rate each output 1–5 and average the scores. If they disagree by more than two points, discuss and re-rate. The exercise takes ten minutes and eliminates most subjective drift.
Automate the mechanical checks (JSON validity, word count, token count) in a tiny script that runs in CI. Most teams use under 30 lines of Python; anything that can be unit-tested should be.
Step 4: Track Over Time (and Gate Releases)
Store the pass rate in a simple metric prompt_v1.3.2_accuracy = 0.93. Any subsequent change that drops the number below the previous baseline is automatically flagged in the pull request. If you already follow a repeatable prompt workflow, hook the evaluation into the same CI stage that runs unit tests (see our article on repeatable prompt workflows for a full example).
Keep a short evaluation report in the repo: date, model version, pass rate, notes. The paper trail is priceless when an auditor—or your future self—asks what changed last month.
A Real Mini-Example
Prompt task: extract company name and invoice total from an email body.
Criteria:
- JSON is syntactically correct
- company_name exists and is ≤ 50 characters
- total_amount is a positive number
Test set: 20 anonymised emails.
Baseline pass rate: 18/20 = 90 %.
A new engineer reformats the prompt to “make it friendlier.” Re-run shows 14/20 = 70 %. The regression is caught before the branch is merged, and the friendly adjectives are moved to a non-critical part of the prompt. Pass rate returns to 90 %, and the feature ships on time.
Common Pitfalls and How to Avoid Them
Pitfall: Letting the test set grow forever
Fix: Cap at 50 examples. When you add five new ones, remove the five oldest to keep maintenance sane.
Pitfall: Only testing on synthetic data
Fix: Keep at least 50 % real user inputs so the distribution matches production.
Pitfall: Ignoring cost
Fix: Add a token-count assertion and fail the build if the median jumps > 15 %.
Pitfall: Re-using the same 20 examples for a year
Fix: Schedule a calendar reminder every quarter to swap in fresh logs.
Tools That Speed Things Up
You can stay in Excel and Bash, but purpose-built tooling removes friction:
- Prompt Repo eval runner – bulk-executes prompts and returns KPI
Choose the lightest stack that makes skipping evaluation harder than doing it.
Checklist: Is Your Evaluation Framework “Good Enough”?
- Criteria are written down and numbered ≤ 5
- Test set is version-controlled alongside the prompt
- At least one metric is automated (JSON, word count, cost)
- Pass rate is recorded for every prompt version
- Regression triggers an automatic warning in CI/CD
If you tick all five boxes, your prompt evaluation is already more rigorous than most others.
Takeaway
Prompt evaluation does not need to be fancy; it needs to be consistent. Define what “good” looks like, keep a small test set, score every change, and track the numbers. Do that and prompt quality stops being a guessing game. It becomes a team habit, and your AI features ship with the same confidence you expect from any other production code.