Layer 3: MCP Tools

How to Optimize MCP Tools

Every tool description is injected on every turn, whether the model needs it or not. Vague descriptions cause wrong tool selection. Too many tools create decision paralysis. Here's a repeatable process to trim, rewrite, and restructure your tool layer.

Step 1: Define your success signal

Tool layer optimization has two things to measure:

Two layers of success
1. Tool selection accuracy: did the model pick the right tool?
   For each query, check if the model called the correct tool
   on the first attempt. Wrong tool selection means wasted calls,
   retries, and broken workflows.

2. Call efficiency: how many tool calls to complete the task?
   A well-structured tool layer lets the model resolve a query
   in 1-3 calls. A bloated one forces 5-10 calls as the model
   guesses, retries, and chains unnecessary lookups.

Tool selection accuracy is the primary signal. If the model picks the wrong tool, call efficiency collapses downstream. Fix selection first, then optimize for fewer calls.

Step 2: Generate test cases

Pull real queries from production that trigger tool calls. Include queries that should map to a single tool, queries that need multiple tools in sequence, and queries that should not trigger any tool at all.

Example: support agent with 28 exposed tools
Single tool (clear mapping):
  "What's the status of order #4821?"          → search_orders
  "Update the customer's email address"        → update_user
  "Create a ticket for this billing issue"     → create_ticket

Multi-tool (sequence required):
  "Refund order #4821 and notify the customer" → get_order → refund → send_notification
  "Look up this user's last 3 orders"          → get_user → search_orders

No tool needed:
  "What's your return policy?"                 → answer from prompt, no tool call
  "Thanks, that's all I needed"                → close conversation
  "Can you explain what that error means?"     → explain, don't look up

Ambiguous (model has to decide):
  "Search for information about this customer"  → get_user? search_orders? both?
  "Send them something about the update"        → send_notification? which template?

Aim for 30 to 50 test queries. The ambiguous cases are the most important. Those are where vague tool descriptions cause the model to guess wrong.

Step 3: Benchmark the baseline

Run every test query against the current tool configuration. Record which tool the model selected, whether it was correct, how many total calls it made, and the token overhead from tool descriptions.

Baseline: 28 tools exposed
Test queries:      48

Tool selection accuracy:
  Correct first pick:   31/48 (64.6%)
  Wrong tool, retried:  11/48 (22.9%)
  Wrong tool, failed:    6/48 (12.5%)

Call efficiency:
  Avg calls per task:   6.2 (target: 2-3)
  Unnecessary calls:    3.1 per task (lookups, retries, wrong tools)

Tool utilization:
  Tools called at least once:  17/28 (60.7%)
  Tools never called:          11/28 (39.3%)

Token overhead:
  Tool descriptions:   6,120 tokens per turn
  Avg total per call:  6,120 (tools) + 1,800 (prompt) + 640 (completion)
  Cost per call:       $0.0257

This is your floor. Notice that 11 tools were never called. That is 2,400 tokens of dead weight on every single turn. And the 64.6% first-pick accuracy means the model is guessing wrong on a third of queries.

Step 4: Generate optimization candidates

Tool layer optimization has three levers. Use agents to propose changes across all three:

Lever 1: Rewrite tool descriptions
The #1 cause of wrong-tool selection is vague descriptions.

Before: "Search the database for information. You can search
for users, orders, products, or anything else."

After: "Query orders by user_id or order_id. Returns:
order_status, total, created_at. Use ONLY for order lookup.
Not users (use get_user) or products (use search_catalog)."

Scoped descriptions with negative constraints tell the model
exactly when to use this tool and when NOT to.
Lever 2: Prune unused tools and add routing skills
Remove the 11 tools that were never called.
For complex multi-tool sequences, create a skill that
orchestrates the calls instead of letting the model figure
out the sequence on its own.

Example: instead of the model chaining
  get_order → check_refund_eligibility → process_refund → send_notification
Create a "process_refund" skill that handles the whole sequence.
One tool call instead of four. Less room for error.
Lever 3: Add routing hints to the system prompt
Add a tool routing section to the system prompt so the model
knows which tool to reach for first.

Tool routing
- Order questions → search_orders
- User account issues → get_user, update_user
- Billing disputes → get_invoice, create_ticket
- Refunds → process_refund (skill, handles full sequence)
- Everything else → ask for clarification first

Use the same agent approaches (consensus, debate, or single model) to propose which descriptions to rewrite, which tools to prune, and where to add skills or routing hints.

Step 5: Test candidates against the same baseline

Run the exact same 48 test queries against the optimized tool configuration. Compare tool selection accuracy, call efficiency, and token cost to the baseline.

After optimization: same 48 test queries
Test queries:      48

Tool selection accuracy:
  Correct first pick:   45/48 (93.8%)
  Wrong tool, retried:   2/48 (4.2%)
  Wrong tool, failed:    1/48 (2.1%)

Call efficiency:
  Avg calls per task:   2.4 (down from 6.2)
  Unnecessary calls:    0.3 per task (down from 3.1)

Tool configuration:
  Tools exposed:        17 (down from 28, 11 pruned)
  Skills added:          3 (refund, onboarding, escalation)
  Routing hints:         8

Token overhead:
  Tool descriptions:   1,840 tokens per turn (down from 6,120)
  Avg total per call:  1,840 (tools) + 1,800 (prompt) + 380 (completion)
  Cost per call:       $0.0121

Selection

+29pts

64.6% → 93.8% first-pick accuracy

Calls

2.6x fewer

6.2 → 2.4 avg calls per task

Cost

-53%

$0.0257 → $0.0121 per call

Step 6: Map to business outcomes

Tool layer waste compounds differently than prompt or context waste. Every wrong tool call is a wasted API call, added latency, and potential for cascading errors. The cost is not just tokens. It is reliability.

Token-cost-to-outcome per workflow
Workflow              Calls/mo   Before      After       Savings/mo
────────────────────────────────────────────────────────────────────
Support agent         42,000     $1,079      $508        $571
Order management      28,000     $719        $339        $380
Billing automation    15,000     $386        $182        $204
Internal tools         8,500     $218        $103        $115

Total monthly savings: $1,270
Annual savings:        $15,240

Additional: 3.8x fewer unnecessary tool calls = faster responses,
fewer errors, fewer escalations to human agents.

Tool layer optimization often has the highest ROI because the savings are multiplicative. Fewer calls per task means less cost, less latency, and fewer failure points. A task that took 6 calls now takes 2.

Then do it again

Tool configurations drift just like prompts. New MCP servers get connected, new tools get exposed, descriptions get copy-pasted from docs without editing. The loop runs continuously:

1. Define success signal (selection accuracy + call efficiency)
2. Generate test queries from production tool call logs
3. Benchmark selection accuracy, call count, and token overhead
4. Generate optimization candidates (rewrite, prune, add skills/routing)
5. Test candidates: keep if selection improves at lower call count
6. Map to business outcomes: prioritize by call volume x savings
7. Re-audit quarterly, or when new tools/servers are added

The same discipline applies to any tool integration. MCP servers, function calling, API tools, custom skills. If the model has to choose between tools, the descriptions and routing determine whether it chooses right.

Selection

+29pts

Right tool, first try

Calls

2.6x fewer

Less cost, less latency, fewer errors

ROI

53%

Fewer calls, fewer errors, less cost

Benchmark an AI workflow.

We'll benchmark an AI workflow and show where the biggest gains are in cost, quality, speed, and performance.

Fixed-scope benchmark. You keep everything.