Layer 4: Model Routing
How to Optimize Model Routing
Most teams pick one model and use it for everything. Opus for classification. Opus for generating test data. Opus for voting on options. Not every task needs your most expensive model. Here's a repeatable process to route the right model to the right task.
Step 1: Define your success signal
Model selection optimization has two things to measure:
1. Quality parity: does a cheaper model produce the same result? Run the same task on multiple models. Compare outputs. If a $0.25/MTok model matches a $15/MTok model on this task, you are overpaying by 60x. 2. Cost per quality point: what are you paying for each unit of quality? Some tasks need the expensive model. The question is which ones. Map every task type to the cheapest model that meets the quality bar.
The goal is not to use the cheapest model everywhere. It is to stop using the most expensive model on tasks that do not need it. Most workflows spend 80% of tokens on tasks that run just as well on a smaller model.
Step 2: Generate test cases
Catalog every distinct task type in your AI workflow. Group them by complexity. Pull representative examples of each.
Low complexity (likely Haiku-capable): Classification: "Is this a billing question or a feature request?" Tagging: "Extract the product name and issue type from this ticket" Routing: "Which team should handle this?" Validation: "Does this JSON match the expected schema?" Medium complexity (likely Sonnet-capable): Summarization: "Summarize this 3-page contract" Research: "Find the relevant docs for this API error" Sub-agent votes: "Which of these 3 options is best?" Data extraction: "Pull all dates and dollar amounts from this email" High complexity (may need Opus): Multi-step reasoning: "Debug this failing pipeline" Code generation: "Write a migration script for this schema change" Strategy: "Propose an architecture for this new feature" Ambiguous judgment: "Should we approve this edge-case refund?"
Aim for 10 to 15 representative examples per complexity tier. The medium tier is the most important to test because that is where the biggest savings live. Tasks that feel like they need Opus but actually run fine on Sonnet.
Step 3: Benchmark the baseline
Run every test case on all three model tiers. Score the output quality for each. Record the cost per call. This gives you a quality-to-cost ratio for every task type on every model.
Test cases: 45 (15 per complexity tier) Current setup: all tasks on Opus ($15/MTok) Quality scores by tier (rubric: correctness, completeness, format): Low complexity: Opus 4.6/5 | Sonnet 4.5/5 | Haiku 4.3/5 Medium complexity: Opus 4.4/5 | Sonnet 4.2/5 | Haiku 3.1/5 High complexity: Opus 4.5/5 | Sonnet 3.8/5 | Haiku 2.4/5 Cost per call (avg): Low complexity: Opus $0.012 | Sonnet $0.003 | Haiku $0.0002 Medium complexity: Opus $0.038 | Sonnet $0.008 | Haiku $0.0006 High complexity: Opus $0.065 | Sonnet $0.014 | Haiku $0.0011 Current monthly cost (all Opus): $2,840 Task distribution: 45% low, 35% medium, 20% high
This is your floor. Notice that Haiku scores 4.3/5 on low-complexity tasks where Opus scores 4.6/5. That 0.3 point difference costs 60x more per call. And Sonnet matches Opus within 0.2 points on medium tasks at 5x less cost.
Step 4: Generate optimization candidates
Model selection has three optimization patterns. Use agents to propose a routing strategy:
Map every task type to the cheapest model that meets your quality threshold. Task Model Quality Cost Classification Haiku 4.3/5 $0.0002 Tagging Haiku 4.4/5 $0.0003 Summarization Sonnet 4.2/5 $0.008 Sub-agent votes Sonnet 4.1/5 $0.003 Code generation Opus 4.5/5 $0.065 Complex reasoning Opus 4.5/5 $0.065 Quality threshold: 4.0/5 minimum. Anything above that, use the cheapest model that clears the bar.
For tasks where quality varies by input complexity: 1. Try Haiku first 2. Check confidence or output quality 3. If below threshold, retry on Sonnet 4. If still below threshold, escalate to Opus ~75% resolve at Haiku tier ~20% need Sonnet ~5% escalate to Opus Best for: classification, routing, validation. The majority of inputs are straightforward.
For decisions and evaluations, 10 cheap agents beat 1 expensive one. 10x Sonnet agents with different framings: Cost: ~$0.40 total Result: consensus filters hallucinations, surfaces edge cases 1x Opus agent: Cost: ~$0.50 Result: single perspective, no error correction Cheaper AND more reliable. Stochastic variation across many cheap models is a better strategy than one expensive run.
Use the same agent approaches (consensus, debate, or single model) to propose which tasks to downgrade, where to add cascades, and which decisions should use multi-agent consensus instead of a single expensive call.
Step 5: Test candidates against the same baseline
Run the exact same 45 test cases through the proposed routing strategy. Compare quality scores and cost to the all-Opus baseline.
Test cases: 45 Routing strategy: Low complexity: Haiku (15 tasks) Medium complexity: Sonnet (12 tasks) + Cascade (3 tasks) High complexity: Opus (10 tasks) + Consensus (5 tasks) Quality scores (post-routing): Low complexity: 4.3/5 (was 4.6 on Opus. Delta: -0.3, acceptable.) Medium complexity: 4.2/5 (was 4.4 on Opus. Delta: -0.2, acceptable.) High complexity: 4.5/5 (unchanged. Opus still handles these.) Cost per call (avg, post-routing): Low complexity: $0.0002 (was $0.012) Medium complexity: $0.009 (was $0.038) High complexity: $0.052 (was $0.065) Monthly cost (routed): $486 Previous monthly cost: $2,840
Cost
-83%
$2,840 → $486 per month
Quality
-0.2pts avg
4.5 → 4.3 avg. Within threshold.
Latency
2.8x faster
Smaller models respond faster
Step 6: Map to business outcomes
Model selection optimization delivers the largest percentage cost reduction of any surface because the price difference between tiers is 5 to 60x. Even a small shift in task routing changes the economics dramatically.
Model tier Tasks Before After Savings/mo ──────────────────────────────────────────────────────────────────── Low (→ Haiku) 45% $1,278 $18 $1,260 Medium (→ Sonnet) 35% $994 $284 $710 High (→ Opus) 20% $568 $184 $384 Total monthly savings: $2,354 Annual savings: $28,248 Quality impact: -0.2 points average. No task dropped below the 4.0/5 minimum threshold.
The insight is not that cheaper models are good enough. It is that expensive models are wasted on simple tasks. The quality difference between Opus and Haiku on a classification task is negligible. The cost difference is 60x.
Then do it again
Model capabilities change with every release. A task that needed Opus six months ago might run fine on Sonnet today. New models get added, pricing changes, quality thresholds shift. The loop runs continuously:
1. Define success signal (quality parity + cost per quality point) 2. Catalog task types and pull representative examples 3. Benchmark every task on every model tier 4. Generate routing strategy (task-level, cascade, or consensus) 5. Test candidates: keep if quality stays above threshold at lower cost 6. Map to business outcomes: prioritize by task volume x cost delta 7. Re-benchmark after every major model release
The same discipline applies to any model provider. Anthropic, OpenAI, Google, open-source. Every provider has a tier structure. The question is always the same: which tasks are you overpaying for?
Cost
-83%
Right model for the right task
Quality
Maintained
Above threshold on every task
Tasks routed
80%
Shifted to cheaper models without quality loss