LLM Router: Use the Expensive Model Only Where It Earns Its Keep
Most teams run one premium LLM on everything — from classification to architectural decisions. A smart router classifies the request first, then dispatches to a cheap/mid/premium model — cutting cost 30-80% without a quality drop.
IntermediateAI DevOps20 minClaude Haiku/Sonnet/Opus, OpenAI o3/gpt-4o/gpt-4o-mini, Classifier model
1
One model for everything is the 2026 anti-pattern
Premium models (Opus, GPT-4o) cost $30-60 per million tokens. Haiku and gpt-4o-mini — $0.50-2. A 30x gap. And yet 80% of real requests are classification, rephrasing, field extraction, short FAQ answers. At scale, you're paying 30x for the same answer.
Analogy: imagine a company where a senior engineer does everything — from bug reports to architecture. Junior tasks get billed at senior rates. A team with seniority levels is an order of magnitude cheaper and faster — because the junior isn't stuck in senior meetings. An LLM router is the same idea: the right level for the right task.
❌ One model — Opus for everything
- Predictable price, but expensive
- Simple tasks billed at premium rate
- Latency plateau — Opus is always slower than Haiku
- No lever for optimization
✅ Router — Haiku / Sonnet / Opus
- 30-80% savings on real workloads
- Faster on simple requests (Haiku <1s)
- Complex tasks still go to premium
- Balance can be tuned by metrics
Don't measure 'cost per request' — measure 'cost per request class'. 90% of real requests are classification and rephrasing; they don't need Opus. A single average cost figure hides all your savings.
2
The classifier is itself a request
To route smartly, you need to know what kind of request this is. The trap: if the classifier is too smart (= expensive), the savings are eaten by the classifier itself. If too simple, it misroutes, and requests fly into the wrong model.
The right balance — a cheap model + structured output + a fallback policy. Haiku with JSON output classifies in <200ms for <$0.0001 per request. That's a tiny routing tax that pays itself back on the first correct dispatch to Haiku instead of Opus.
Signals to classify on: (1) task type — summarize, code, reason, classify; (2) input length — a cheap model wastes money on long context; (3) expected output length; (4) needs tools/function calls; (5) domain expertise level — legal advice and 'rephrase this' are different planets.
What the classifier should check
Task type (summarize / code / reason / classify)
Input length and expected output length
Tools / function calls required
User tier (free / paid) — for prioritization
Classification confidence — below threshold → escalate
Keyword blacklist to force premium (safety, legal)
The classifier must cost <1% of the expensive model. Otherwise you are not optimizing — you are just moving costs to a different line item.
3
Cascade Haiku → Sonnet → Opus: the cheapest tries first
The cascade pattern: start with the cheapest model, check quality, escalate only on failure. Haiku goes first, Sonnet picks up harder cases, Opus kicks in on genuinely tough ones. Most traffic settles on Haiku — and so does the budget.
Why this beats parallel voting (three models answering at once): in voting you always pay for all three. In a cascade you only pay for the expensive path on failures. The difference for a typical 70/20/10 tier load: voting costs the sum of all three on every request; cascade costs Haiku + 30% Sonnet + 10% Opus. A 3-5x gap.
Key nuance: when the cascade escalates, pass the junior model's answer as a hint to the senior. Opus sees where Haiku stumbled — that's both task context and an explicit anti-example. The final answer quality beats what Opus would produce solving from scratch.
Request
Haiku
Quality gate
OK
Answer
fail — escalate
Sonnet
Quality gate
fail — escalate
Opus
cascade(запрос):
ответ_1 = Haiku(запрос)
если quality_gate(ответ_1) == "ok":
вернуть ответ_1
ответ_2 = Sonnet(запрос, hint=ответ_1) # использует контекст
если quality_gate(ответ_2) == "ok":
вернуть ответ_2
вернуть Opus(запрос, hints=[ответ_1, ответ_2])
# Opus видит, где младшие ошиблись — больше шансов не повторить4
Quality gate: without it the cascade becomes a lottery
A quality gate is the function that decides whether the cheap model actually did the job. Without a gate you returned a bad answer and saved $0.01 — that's not optimization, that's degrading your product for pennies.
Three types of gate, from simple to complex. First — programmatic: schema check, regex, length, JSON structure. Costs zero, catches gross failures. Good for structured output: classification, extraction, form-filling.
Second — self-critique: the cheap model evaluates its own answer against a stricter prompt. 'Is this answer complete? Any contradictions?' Costs one more Haiku call, catches semantic failures — answer is syntactically correct but factually empty.
Third — retrieval grounding: check that the answer is backed by real sources. For RAG pipelines, must-have. Cites a source not in the retrieved set — hallucinated, escalate.
The health metric for your gate is fallback rate — share of escalated requests. <20% — saving well. 20-60% — normal. >60% — junior model too low, start the cascade at Sonnet. A bad gate is worse than no router: Haiku quality at Opus prices.
Log pairs of (input → model_used → escalated?). After a week you will see which request classes consistently escalate and can route them straight to Sonnet, skipping Haiku — that kills one extra call and speeds them up.
5
Router metrics: so you don't roll back to a single Opus
Without metrics the story is always the same: in two weeks someone says 'quality is down', management panics, and you roll back to a single premium. Router buried. With metrics, you see exactly which request class is suffering and fix it surgically without touching the rest.
The five metrics below are the minimum. Track them per request class, not as an average: an overall average will hide a problem in 10% of traffic that breaks a key scenario. And always keep a baseline — the same traffic on a single premium — otherwise there is nothing to compare against.
| Metric | What it measures | Target |
|---|---|---|
| Cost per request | Average request cost in $ | 30-80% drop vs baseline |
| Escalation rate | Share of requests escalated to premium | 10-30% — normal |
| Quality score (by class) | User rating or auto-eval per class | Not below single-model baseline |
| Routing accuracy | Did the classifier pick the right tier | >90% |
| p95 latency | Speed at the 95th percentile | No worse than single-model |
Run the router on 5% of traffic in parallel with the main model for 1-2 weeks. Compare quality by request class, not overall. Only then flip 100%. This pays itself back with a single prevented rollback.
Result
A working LLM router with a classifier and a Haiku → Sonnet → Opus cascade that cuts costs by 30-80% without quality loss. A quality gate filters out bad cheap-model answers, and per-class metrics protect against blindly rolling back to a single premium.