ia-framework

Multi-Model Orchestration: When Opus, Sonnet, Haiku, and Grok Each Have a Role

Building an LLM-agnostic architecture where model selection is a feature, not an accident. Five models, each with a role, ready for whatever comes next.

Chris Groves

31 Dec 2025 — 5 min read

The model landscape changes faster than my code. In the past year alone, I've watched Claude evolve through Sonnet 4, Haiku 4.5, and now Opus 4.5. Grok went from novelty to serious contender. New models emerge weekly from OpenAI, Google, Meta, Mistral, and dozens of others.

If I hardcoded model IDs into my framework, I'd be rewriting integrations constantly. Instead, I built for model turnover from day one.

The Multi-Model Reality

No single model dominates across all dimensions. The benchmarks tell a clear story:

Claude Opus 4.5: 80.9% SWE-bench (first model to exceed 80%), frontier reasoning
Claude Sonnet 4.5: 77.2% SWE-bench, balanced cost/capability
Claude Haiku 4.5: Near-Sonnet performance at 1/3 the cost, lowest misalignment rate
Grok 3/4: Real-time X integration, reasoning traces, different thinking style
Perplexity: Citation-heavy research, real-time web grounding

The cost differentials are significant: Grok Fast at $0.20/M input vs Opus at $5/M input - a 25x difference on input alone. Using the wrong model for a task isn't just inefficient, it's wasteful.

LLM Agnostic by Architecture, Not Accident

Most multi-model discussions focus on "which model is best" comparisons. That's the wrong question. The architecture is the insight, not the model choice.

In my framework, Claude is the daily driver because Claude Code is my orchestrator. But every model reference uses dynamic discovery:

from tools.research.openrouter import get_latest_model

# Always gets the latest version - no manual updates
opus = get_latest_model("anthropic", prefer_keywords=["opus"])
sonnet = get_latest_model("anthropic", prefer_keywords=["sonnet"])
grok = get_latest_model("xai", prefer_keywords=["grok"])

No hardcoded model IDs in skills or agents. When Opus 5 releases, I update one config. When a new competitor emerges, I add it to the routing matrix.

OpenRouter provides the abstraction layer: 300+ models from 60+ providers through a single API. One key, one contract, one bill. The same patterns work for any model ecosystem.

The 5-Model Ecosystem

Each model has a defined role in my workflows:

Claude Opus 4.5 - Strategic Thinking

Novel architecture decisions
Complex multi-step reasoning
Long-horizon autonomous tasks
When getting it right matters more than speed

Claude Sonnet 4.5 - Daily Driver

Day-to-day coding assistance
Standard skill execution
Most writing and documentation
The default when no specialized need exists

Claude Haiku 4.5 - Fast Validation

QA checklists and structured review
Routing decisions and triage
Sub-agent tasks where speed matters
First Haiku with extended thinking

Grok - Adversarial Review

Challenges assumptions in QA workflows
Real-time data when currency matters
Different "thinking style" for cross-validation
Philosophy: "digital companion" vs Claude's "digital expert"

Perplexity - Research Grounding

Citation-heavy OSINT research
When I need sources, not opinions
Fact verification with provenance

The workflow pattern: "Haiku for setup, Sonnet for builds, Opus for reviews" - a combination that works.

Future-Proofing Through Dynamic Discovery

Most articles say "be model agnostic" without showing how. Here's my implementation:

1. Dynamic Model Discovery

The fetch_models.py script queries OpenRouter for the latest models matching criteria. Preference keywords handle naming variations across versions:

# Handles "claude-3-opus", "claude-opus-4", "claude-opus-4-5", etc.
model = get_latest_model("anthropic", prefer_keywords=["opus"])

No manual updates when models change naming conventions.

2. Model Selection Matrix as Living Document

My model-selection-matrix.md isn't static documentation - it's a decision tree that guides selection:

Task Type	Primary	Fallback	Why
Architecture decisions	Opus	Sonnet	Reasoning depth
Standard coding	Sonnet	Haiku	Cost/capability balance
QA validation	Haiku + Grok	Sonnet	Speed + perspective
Real-time research	Perplexity	Grok	Citation quality

When benchmarks shift, I update the matrix. Not tribal knowledge - codified decisions.

3. Hook-Based Routing

A PreToolUse hook scans task keywords and recommends models:

Task mentions "architecture" → Suggests Opus
Task mentions "quick check" → Suggests Haiku
Task mentions "current events" → Suggests Grok

Soft recommendations, not enforcement. Zero latency - no additional API call needed.

Provider Diversity as Risk Management

Vendor lock-in is a real concern. My provider distribution:

Primary: Anthropic (Claude) - via Claude Code orchestration
Secondary: xAI (Grok) - adversarial review, real-time data
Tertiary: Perplexity - OSINT, citations
Infrastructure: OpenRouter - 300+ models, automatic failover across 50+ providers

If Anthropic's API goes down, Grok handles QA review. If pricing changes dramatically, the routing matrix adjusts. If a new model outperforms on specific tasks, I add it to the ecosystem.

No single provider failure stops work. Degraded capabilities, not broken workflows.

The Evaluation Foundation

The framework doesn't claim optimal routing. It claims transparent routing that can be measured and improved.

Built-in Comparison Points

Cost per task type: Documented in the selection matrix
Capability matching: Which tasks route where and why
Failure modes: When each model struggles (documented)
Complementary strengths: Dual-model patterns that catch more issues

Ongoing Evaluation Hooks

Hook-based routing logs which models get recommended
QA review tracks which model caught which issues
Token usage tracked per model per task type
Foundation for A/B testing different routing strategies

IDC predicts that by 2028, 70% of top AI-driven enterprises will use advanced multi-tool architectures for dynamic model routing. The foundation needs to exist before optimization can happen.

What This Doesn't Solve (Yet)

I'm explicit about limitations:

Keyword matching is imprecise: "Quick architecture review" matches both "quick" (Haiku) and "architecture" (Opus)
First-match-wins has edge cases: Priority order matters, which means some valid recommendations get overridden
No learning from overrides: When I ignore a recommendation, that signal isn't captured
No automated benchmarking: Model comparisons are manual, not continuous

For a personal framework, simple-and-transparent beats sophisticated-and-opaque. ML-based routing adds latency and complexity. Optimization comes after I have baseline metrics.

The average AI model lifecycle is under 18 months. Building for perfect routing today means rebuilding when the landscape shifts tomorrow. Building for adaptability means the foundation stays solid while models come and go.

Getting Started

If you're building similar patterns:

Abstract model references: Never hardcode claude-3-opus-20240229 - use dynamic lookup
Document your routing logic: A decision matrix beats intuition
Plan for failure: What happens when your primary model is unavailable?
Measure before optimizing: Log which models handle which tasks before adding ML routing
Stay provider-diverse: One API key doesn't mean one provider

The goal isn't finding the "best" model. The goal is building systems that adapt as the definition of "best" keeps changing.