LLM Router: Use the Expensive Model Only Where It Earns Its Keep

Most teams run one premium LLM on everything — from classification to architectural decisions. A smart router classifies the request first, then dispatches to a cheap/mid/premium model — cutting cost 30-80% without a quality drop.

IntermediateAI DevOps20 minClaude Haiku/Sonnet/Opus, OpenAI o3/gpt-4o/gpt-4o-mini, Classifier model

One model for everything is the 2026 anti-pattern

Premium models (Opus, GPT-4o) cost $30-60 per million tokens. Haiku and gpt-4o-mini — $0.50-2. A 30x gap. And yet 80% of real requests are classification, rephrasing, field extraction, short FAQ answers. At scale, you're paying 30x for the same answer. Analogy: imagine a company where a senior engineer does everything — from bug reports to architecture. Junior tasks get billed at senior rates. A team with seniority levels is an order of magnitude cheaper and faster — because the junior isn't stuck in senior meetings. An LLM router is the same idea: the right level for the right task.

❌ One model — Opus for everything

Predictable price, but expensive
Simple tasks billed at premium rate
Latency plateau — Opus is always slower than Haiku
No lever for optimization

✅ Router — Haiku / Sonnet / Opus

30-80% savings on real workloads
Faster on simple requests (Haiku <1s)
Complex tasks still go to premium
Balance can be tuned by metrics

Don't measure 'cost per request' — measure 'cost per request class'. 90% of real requests are classification and rephrasing; they don't need Opus. A single average cost figure hides all your savings.

The classifier is itself a request

To route smartly, you need to know what kind of request this is. The trap: if the classifier is too smart (= expensive), the savings are eaten by the classifier itself. If too simple, it misroutes, and requests fly into the wrong model. The right balance — a cheap model + structured output + a fallback policy. Haiku with JSON output classifies in <200ms for <$0.0001 per request. That's a tiny routing tax that pays itself back on the first correct dispatch to Haiku instead of Opus. Signals to classify on: (1) task type — summarize, code, reason, classify; (2) input length — a cheap model wastes money on long context; (3) expected output length; (4) needs tools/function calls; (5) domain expertise level — legal advice and 'rephrase this' are different planets.

What the classifier should check

Task type (summarize / code / reason / classify)

Input length and expected output length

Tools / function calls required

User tier (free / paid) — for prioritization

Classification confidence — below threshold → escalate

Keyword blacklist to force premium (safety, legal)

The classifier must cost <1% of the expensive model. Otherwise you are not optimizing — you are just moving costs to a different line item.

Cascade Haiku → Sonnet → Opus: the cheapest tries first

The cascade pattern: start with the cheapest model, check quality, escalate only on failure. Haiku goes first, Sonnet picks up harder cases, Opus kicks in on genuinely tough ones. Most traffic settles on Haiku — and so does the budget. Why this beats parallel voting (three models answering at once): in voting you always pay for all three. In a cascade you only pay for the expensive path on failures. The difference for a typical 70/20/10 tier load: voting costs the sum of all three on every request; cascade costs Haiku + 30% Sonnet + 10% Opus. A 3-5x gap. Key nuance: when the cascade escalates, pass the junior model's answer as a hint to the senior. Opus sees where Haiku stumbled — that's both task context and an explicit anti-example. The final answer quality beats what Opus would produce solving from scratch.

Request

Haiku

Quality gate

Answer

fail — escalate

Sonnet

Quality gate

fail — escalate

Opus

cascade(запрос):
  ответ_1 = Haiku(запрос)
  если quality_gate(ответ_1) == "ok":
    вернуть ответ_1

  ответ_2 = Sonnet(запрос, hint=ответ_1)  # использует контекст
  если quality_gate(ответ_2) == "ok":
    вернуть ответ_2

  вернуть Opus(запрос, hints=[ответ_1, ответ_2])
  # Opus видит, где младшие ошиблись — больше шансов не повторить

Quality gate: without it the cascade becomes a lottery

A quality gate is the function that decides whether the cheap model actually did the job. Without a gate you returned a bad answer and saved $0.01 — that's not optimization, that's degrading your product for pennies. Three types of gate, from simple to complex. First — programmatic: schema check, regex, length, JSON structure. Costs zero, catches gross failures. Good for structured output: classification, extraction, form-filling. Second — self-critique: the cheap model evaluates its own answer against a stricter prompt. 'Is this answer complete? Any contradictions?' Costs one more Haiku call, catches semantic failures — answer is syntactically correct but factually empty. Third — retrieval grounding: check that the answer is backed by real sources. For RAG pipelines, must-have. Cites a source not in the retrieved set — hallucinated, escalate. The health metric for your gate is fallback rate — share of escalated requests. <20% — saving well. 20-60% — normal. >60% — junior model too low, start the cascade at Sonnet. A bad gate is worse than no router: Haiku quality at Opus prices.

Log pairs of (input → model_used → escalated?). After a week you will see which request classes consistently escalate and can route them straight to Sonnet, skipping Haiku — that kills one extra call and speeds them up.

Router metrics: so you don't roll back to a single Opus

Without metrics the story is always the same: in two weeks someone says 'quality is down', management panics, and you roll back to a single premium. Router buried. With metrics, you see exactly which request class is suffering and fix it surgically without touching the rest. The five metrics below are the minimum. Track them per request class, not as an average: an overall average will hide a problem in 10% of traffic that breaks a key scenario. And always keep a baseline — the same traffic on a single premium — otherwise there is nothing to compare against.

Metric	What it measures	Target
Cost per request	Average request cost in $	30-80% drop vs baseline
Escalation rate	Share of requests escalated to premium	10-30% — normal
Quality score (by class)	User rating or auto-eval per class	Not below single-model baseline
Routing accuracy	Did the classifier pick the right tier	>90%
p95 latency	Speed at the 95th percentile	No worse than single-model

Run the router on 5% of traffic in parallel with the main model for 1-2 weeks. Compare quality by request class, not overall. Only then flip 100%. This pays itself back with a single prevented rollback.

Result

A working LLM router with a classifier and a Haiku → Sonnet → Opus cascade that cuts costs by 30-80% without quality loss. A quality gate filters out bad cheap-model answers, and per-class metrics protect against blindly rolling back to a single premium.

All Recipes

LLM Router: Use the Expensive Model Only Where It Earns Its Keep

IntermediateAI DevOps20 minClaude Haiku/Sonnet/Opus, OpenAI o3/gpt-4o/gpt-4o-mini, Classifier model

One model for everything is the 2026 anti-pattern

❌ One model — Opus for everything

Predictable price, but expensive
Simple tasks billed at premium rate
Latency plateau — Opus is always slower than Haiku
No lever for optimization

✅ Router — Haiku / Sonnet / Opus

30-80% savings on real workloads
Faster on simple requests (Haiku <1s)
Complex tasks still go to premium
Balance can be tuned by metrics

The classifier is itself a request

What the classifier should check

Task type (summarize / code / reason / classify)

Input length and expected output length

Tools / function calls required

User tier (free / paid) — for prioritization

Classification confidence — below threshold → escalate

Keyword blacklist to force premium (safety, legal)

The classifier must cost <1% of the expensive model. Otherwise you are not optimizing — you are just moving costs to a different line item.

Cascade Haiku → Sonnet → Opus: the cheapest tries first

Request

Haiku

Quality gate

Answer

fail — escalate

Sonnet

Quality gate

fail — escalate

Opus

cascade(запрос):
  ответ_1 = Haiku(запрос)
  если quality_gate(ответ_1) == "ok":
    вернуть ответ_1

  ответ_2 = Sonnet(запрос, hint=ответ_1)  # использует контекст
  если quality_gate(ответ_2) == "ok":
    вернуть ответ_2

  вернуть Opus(запрос, hints=[ответ_1, ответ_2])
  # Opus видит, где младшие ошиблись — больше шансов не повторить

Quality gate: without it the cascade becomes a lottery

Router metrics: so you don't roll back to a single Opus

Metric	What it measures	Target
Cost per request	Average request cost in $	30-80% drop vs baseline
Escalation rate	Share of requests escalated to premium	10-30% — normal
Quality score (by class)	User rating or auto-eval per class	Not below single-model baseline
Routing accuracy	Did the classifier pick the right tier	>90%
p95 latency	Speed at the 95th percentile	No worse than single-model

LLM Router: Use the Expensive Model Only Where It Earns Its Keep

One model for everything is the 2026 anti-pattern

❌ One model — Opus for everything

✅ Router — Haiku / Sonnet / Opus

The classifier is itself a request

What the classifier should check

Cascade Haiku → Sonnet → Opus: the cheapest tries first

Quality gate: without it the cascade becomes a lottery

Router metrics: so you don't roll back to a single Opus

Result

Related Theory

LLM Router: Use the Expensive Model Only Where It Earns Its Keep

One model for everything is the 2026 anti-pattern

❌ One model — Opus for everything

✅ Router — Haiku / Sonnet / Opus

The classifier is itself a request

What the classifier should check

Cascade Haiku → Sonnet → Opus: the cheapest tries first

Quality gate: without it the cascade becomes a lottery

Router metrics: so you don't roll back to a single Opus

Result

Related Theory