Layer 1: Prompts

How to Optimize Prompts

Most teams use AI to generate an initial system prompt without a process for optimizing it. These are the static instructions your AI runs on every call. Not user messages, but the rules that shape every response. Here's a repeatable process to continuously improve them.

Step 1: Define your success signal

Before touching the prompt, decide what you're measuring. Most prompts have two layers to test:

Two layers of success

1. Trigger accuracy: did the right workflow fire?
   Pass the input to the model with the system prompt and ask:
   "Would you have handled this? Yes or no."
   Binary. Cheap. High volume.

2. Output quality: given it triggered, is the result good?
   You don't need to run the full workflow. Ask:
   "Given this input and system prompt, how would you classify this?"
   Compare the answer to what's expected.

Test triggers first. No point evaluating output quality on a workflow that shouldn't have fired. This applies differently depending on what you're testing: routing prompts are almost entirely trigger accuracy, scoped skills care more about output quality, system prompts need both plus constraint adherence.

Step 2: Generate test cases

Pull real inputs from production where possible. Include cases that should trigger the workflow, cases that shouldn't, and edge cases that could go either way.

Example: support ticket classifier

Should trigger:
  "I was charged twice for my subscription"
  "Can you help me downgrade my plan?"
  "My API key stopped working after the update"
  "I need a refund for last month"
  "How do I cancel?"

Should NOT trigger:
  "What's your pricing?"
  "Do you have a Go SDK?"
  "Can I talk to someone about a partnership?"
  "How does your product compare to [competitor]?"
  "I'd like to schedule a demo"

Edge cases:
  "I'm having trouble with billing AND want to see new features"
  "Cancel my subscription and recommend an alternative"
  "Is there a discount if I stay?"

Aim for 30 to 50 test cases. Weight by real-world frequency if you have usage data. A trigger that fires 500 times a day matters more than one that fires twice a month.

Step 3: Benchmark the baseline

Run every test case against the current prompt. For trigger accuracy, pass each input and ask the model what it would have done. For output quality, compare the classification against the expected result. Record token count per call.

Baseline: ticket classifier

Test cases:      48

Trigger accuracy:
  Correct fires:   31/36 (86.1%)
  Correct rejects:  7/12 (58.3%)
  → 5 false positives (fired on sales/partner queries)

Output quality (on correct triggers):
  Correct classification:  27/31 (87.1%)
  Wrong category:           4/31

Combined quality score: 56.3% (27/48 fully correct end-to-end)

Avg tokens/call: 2,840 (prompt) + 380 (completion)
Cost per call:   $0.0091

This is your floor. Every optimization is measured against these numbers. No baseline means no proof that changes helped.

Step 4: Generate optimization candidates

This is where most teams guess. Instead, use agents to propose changes systematically. There are three approaches, use whichever fits:

Multi-agent consensus

Spawn 10 agents with different analytical framings
(risk-averse, contrarian, first-principles, etc.)
Each independently proposes optimizations.
Take the changes most agents agree on. Those are safe bets.
Flag the splits for human judgment.

Agent debate

Spawn 3 agents into a shared conversation:
  Architect: thinks in systems
  Pragmatist: optimizes for shipping
  Critic: finds edge cases
Three rounds of debate. They argue, concede, converge.
Synthesize the result.

Single model iteration

Pass the prompt + baseline results to a single model.
"Here's my system prompt. Here's where it's failing.
  5 false positives on sales queries.
  4 misclassifications on billing complaints.
Propose specific changes to fix these failures
without breaking what's already working."

Each approach generates candidate rewrites. The next step decides which ones ship.

Step 5: Test candidates against the same baseline

Run the exact same test cases against each candidate. Compare to the baseline. Keep the change if the quality score goes up. Revert if it drops. One change at a time so you know what worked.

After optimization: same 48 test cases

Test cases:      48

Trigger accuracy:
  Correct fires:   35/36 (97.2%)
  Correct rejects: 11/12 (91.7%)
  → 1 false positive (down from 5)

Output quality (on correct triggers):
  Correct classification:  34/35 (97.1%)
  Wrong category:           1/35

Combined quality score: 93.8% (45/48 fully correct end-to-end)

Avg tokens/call: 1,020 (prompt) + 290 (completion)
Cost per call:   $0.0031

Quality

+37pts

56.3% → 93.8% end-to-end

Tokens

-64%

3,220 → 1,310 per call

Cost

-66%

$0.0091 → $0.0031 per call

Step 6: Map to business outcomes

Optimizing prompts in isolation is half the picture. Map each workflow to the revenue it supports to know where optimization matters most.

Token-cost-to-outcome per workflow

Workflow              Cost/mo    Revenue    Cost:Revenue   Priority
────────────────────────────────────────────────────────────────────
Ticket classifier     $382       $18,000    2.1%           HIGH
Lead scoring          $254       $18,000    1.4%           HIGH
Contract summarizer   $89        $6,000     1.5%           MEDIUM
Onboarding assistant  $136                  cost center    REVIEW
Code review agent     $244                  cost center    REVIEW

Most teams optimize the prompt that annoys them most. This tells you which prompt matters most. The one connected to the highest-value outcome, running at the highest volume, with the most room to improve.

Then do it again

This isn't a one-time cleanup. Prompts drift. New rules get added, models get updated, edge cases pile up. The loop runs continuously:

1. Define success signal (trigger accuracy + output quality)
2. Generate test cases from production inputs
3. Benchmark current quality score and token cost
4. Generate optimization candidates (consensus, debate, or single model)
5. Test candidates → keep if quality improves
6. Map to business outcomes → prioritize by value
7. Re-benchmark quarterly, or after any model/prompt change

The same discipline applies to every layer of your instruction stack. Binary routing checks for orchestrators, rubric-based scoring for task prompts, constraint adherence testing for system prompts. The method adapts. The loop doesn't change.